Download PDF

Deep Language Detection for Indian Code: A Context-Aware Deep Learning for Multilingual Content Classification

Author : Shraddha Yadav, Esha Srivastava and Amit Kumar Srivastava

Abstract :

The multilingual character of India has been seen in the online discussions whereby Hindi, gujarati, and English are highly mixed in a single sentence. Such a phenomenon is referred to as code-mixing and it is quite challenging to standard Natural Language Processing (NLP) systems, especially language identification. Conventional models, which most often were trained on monolingual, standardized data, cannot cope with informal, transliterated, or script-vulnerable text that is frequent on the social media. The proposed paper suggests a deep learning-based language detection model that is tailored to Indian code-mixed text. The model combines context-sensitive word representations of multilingual transformers (mBERT) and bidirectional LSTM sequence decoders with attention mechanisms to identify and tag of languages at the word and sentence levels. A linguistically annotated, manually curated set of Hindi-English-Gujarati Reddit and Twitter posts was constructed and annotated. Experiments show that the suggested model significantly performs better than the baseline mechanisms, especially when it comes to highly mixed and informal content. Moreover, a script-sensitive pre-processing pipeline improves the detection of Roman, Devanagari and Gujarati scripts. The results are promising to the creation of inclusive language technologies in India, such as a chatbot or content moderation system and multilingual information retrieval. This study addresses the problems of low-resource and code-mixed language processing and thus bridges the gap in the research on Indian NLP and preconditions further progress in the field of working with complex, multilingual, and informal online text.

Keywords :

Multilingual Language Detection, Hindi–Gujarati Code-Mixed Text, Natural Language Processing (NLP), Deep Learning, Transformer Models, BERT, IndicBERT, BiLSTM, Self-Attention, Conditional Random Fields (CRF), Transliteration Challenges.