CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling

Baral, Aditeya, Ajith, Allen George, Nayak, Roshan, Bhanja, Mrityunjay Abhijeet

May-20-2025–arXiv.org Artificial Intelligence

Code-mixed languages, characterized by frequent within-sentence language transitions, present structural challenges that standard language models fail to address. In this work, we propose CMLFormer, an enhanced multi-layer dual-decoder Transformer with a shared encoder and synchronized decoder cross-attention, designed to model the linguistic and semantic dynamics of code-mixed text. CMLFormer is pre-trained on an augmented Hinglish corpus with switching point and translation annotations with multiple new objectives specifically aimed at capturing switching behavior, cross-lingual structure, and code-mixing complexity. Our experiments show that CMLFormer improves F1 score, precision, and accuracy over other approaches on the HASOC-2021 benchmark under select pre-training setups. Attention analyses further show that it can identify and attend to switching points, validating its sensitivity to code-mixed structure. These results demonstrate the effectiveness of CMLFormer's architecture and multi-task pre-training strategy for modeling code-mixed languages.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

May-20-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - India (0.04)
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Ukraine > Kyiv Oblast
    - Kyiv (0.04)
- North America > Mexico
  - Mexico City > Mexico City (0.04)

Genre:
- Research Report > New Finding (0.88)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)
  - Natural Language
    - Chatbot (0.84)
    - Large Language Model (0.94)