Publications
Targeted Multilingual Adaptation for Low-resource Language Families, EMNLP Findings 2024, C.M. Downey, Terra Blevins, Dhwani Serai, Dwija Parikh, Shane Steinert-Threlkeld
- Adapted XLM-R for low-resource language families, improving performance and accuracy on POS tagging & dependency parsing
through targeted multilingual training strategies and evaluated hyperparameters to enhance performance across 15+ languages
- Identified key hyperparameters through regression analysis, establishing best practices for up-sampling low-resource languages
without compromising high-resource language performance
Normalization and Back-transliteration for Code-Switched Text, CALCS (NAACL 2021), Dwija Parikh and Thamar Solorio
- Developed a preprocessing module specifically designed for code-switched data, utilizing a hybrid approach that combined rulebased phonemic transcription methods with machine learning techniques, including a seq2seq model employing LSTM networks, resulting in an accuracy rate of 78.6%
- Engineered a novel grapheme-to-phoneme (G2P) conversion technique specifically tailored for romanized Hindi data, enhancing the processing and analysis of code-switched text in social media contexts
- Contributed to the field by releasing a valuable dataset of script-corrected Hindi-English code-switched sentences, meticulously labeled for named entity recognition and part-of-speech tagging tasks, fostering further advancements in code-switching research within NLP
Page Design