Authors: Professor Smita Chunamari, Sahil Tejam, Bhavesh Sonawane, Yash Daund, Janhvi Pawar
Abstract: Paraphrase detection is a crucial task in Natural Language Processing (NLP) that helps systems understand when two sentences mean the same thing, even if they’re phrased differently. While this has been explored extensively in English and a few other global languages, regional languages—rich in diversity and nuance—remain significantly underrepresented. In this study, we explore the challenges and opportunities of building paraphrase detection systems for regional languages, focusing on the unique linguistic features such as dialect variations, code- mixing, and syntactic differences. We develop a multilingual model trained on both parallel and non-parallel regional datasets, enhanced with data augmentation techniques and semantic similarity measures. We also introduce a small but diverse paraphrase corpus for select Indian languages as a benchmark. Our results show that transformer-based models fine-tuned on language-specific data outperform traditional ap- proaches, highlighting the importance of contextual embeddings in low-resource settings. This work not only advances the field of NLP in regional languages but also opens the door for more inclusive and accessible language technologies, ranging from intelligent search systems to educational tools that truly understand the linguistic richness of everyday users.