AITopics | code-switching data

Collaborating Authors

code-switching data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training

Wang, Zhijun, Li, Jiahuan, Zhou, Hao, Weng, Rongxiang, Wang, Jingang, Huang, Xin, Han, Xue, Feng, Junlan, Deng, Chao, Huang, Shujian

arXiv.org Artificial IntelligenceApr-23-2025

Large language models (LLMs) exhibit remarkable multilingual capabilities despite the extreme language imbalance in the pre-training data. In this paper, we closely examine the reasons behind this phenomenon, focusing on the pre-training corpus. We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. We conduct an analysis to investigate code-switching in the pre-training corpus, examining its presence and categorizing it into four types within two quadrants. We then assess its impact on multilingual performance. These types of code-switching data are unbalanced in proportions and demonstrate different effects on facilitating language transfer. To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching. We continuously scale up the synthetic code-switching data and observe remarkable improvements in both benchmarks and representation space. Extensive experiments indicate that incorporating synthetic code-switching data enables better language alignment and generalizes well to high, medium, and low-resource languages with pre-training corpora of varying qualities.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2504.01801

Country:

Europe (0.68)
North America > United States (0.46)
Asia > China (0.28)
North America > Mexico (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)

Add feedback

Data Augmentation for End-to-end Code-switching Speech Recognition

Du, Chenpeng, Li, Hao, Lu, Yizhou, Wang, Lan, Qian, Yanmin

arXiv.org Artificial IntelligenceNov-3-2024

Training a code-switching end-to-end automatic speech recognition (ASR) model normally requires a large amount of data, while code-switching data is often limited. In this paper, three novel approaches are proposed for code-switching data augmentation. Specifically, they are audio splicing with the existing code-switching data, and TTS with new code-switching texts generated by word translation or word insertion. Our experiments on 200 hours Mandarin-English code-switching dataset show that all the three proposed approaches yield significant improvements on code-switching ASR individually. Moreover, all the proposed approaches can be combined with recent popular SpecAugment, and an addition gain can be obtained. WER is significantly reduced by relative 24.0% compared to the system without any data augmentation, and still relative 13.0% gain compared to the system with only SpecAugment

augmentation, speech recognition, utterance, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/slt48900.2021.9383620

2011.0216

Country:

Asia > China > Shanghai > Shanghai (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > Nepal > Bagmati Province > Kathmandu District > Kathmandu (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-Switching

Li, Zhuoran, Hu, Chunming, Chen, Junfan, Chen, Zhijun, Guo, Xiaohui, Zhang, Richong

arXiv.org Artificial IntelligenceJun-19-2024

Code-switching is a data augmentation scheme mixing words from multiple languages into source lingual text. It has achieved considerable generalization performance of cross-lingual transfer tasks by aligning cross-lingual contextual word representations. However, uncontrolled and over-replaced code-switching would augment dirty samples to model training. In other words, the excessive code-switching text samples will negatively hurt the models' cross-lingual transferability. To this end, we propose a Progressive Code-Switching (PCS) method to gradually generate moderately difficult code-switching examples for the model to discriminate from easy to hard. The idea is to incorporate progressively the preceding learned multilingual knowledge using easier code-switching data to guide model optimization on succeeding harder code-switching data. Specifically, we first design a difficulty measurer to measure the impact of replacing each word in a sentence based on the word relevance score. Then a code-switcher generates the code-switching data of increasing difficulty via a controllable temperature variable. In addition, a training scheduler decides when to sample harder code-switching data for model training. Experiments show our model achieves state-of-the-art results on three different zero-shot cross-lingual transfer tasks across ten languages.

code-switching data, code-switching sentence, proc, (15 more...)

arXiv.org Artificial Intelligence

2406.13361

Country:

Asia > India > Haryana (0.05)
Asia > India > Uttar Pradesh (0.05)
Asia > China > Beijing > Beijing (0.04)
(8 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

The Effect of Alignment Objectives on Code-Switching Translation

Anwar, Mohamed

arXiv.org Artificial IntelligenceSep-10-2023

One of the things that need to change when it comes to machine translation is the models' ability to translate code-switching content, especially with the rise of social media and user-generated content. In this paper, we are proposing a way of training a single machine translation model that is able to translate monolingual sentences from one language to another, along with translating code-switched sentences to either language. This model can be considered a bilingual model in the human sense. For better use of parallel data, we generated synthetic code-switched (CSW) data along with an alignment loss on the encoder to align representations across languages. Using the WMT14 English-French (En-Fr) dataset, the trained model strongly outperforms bidirectional baselines on code-switched translation while maintaining quality for non-code-switched (monolingual) data.

arxiv preprint arxiv, code-switched data, translation, (14 more...)

arXiv.org Artificial Intelligence

2309.05044

Country:

Africa (0.05)
North America > United States > Tennessee (0.04)
North America > Canada > Ontario (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Code-Switching Text Generation and Injection in Mandarin-English ASR

Yu, Haibin, Hu, Yuxuan, Qian, Yao, Jin, Ma, Liu, Linquan, Liu, Shujie, Shi, Yu, Qian, Yanmin, Lin, Edward, Zeng, Michael

arXiv.org Artificial IntelligenceMar-20-2023

Code-switching speech refers to a means of expression by mixing two or more languages within a single utterance. Automatic Speech Recognition (ASR) with End-to-End (E2E) modeling for such speech can be a challenging task due to the lack of data. In this study, we investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T), in Mandarin-English code-switching speech recognition. We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces. Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models, i.e., 16% relative Token-based Error Rate (TER) reduction averaged on three evaluation sets, and the approach of tying speech and text latent spaces is superior to that of TTS conversion on the evaluation set which contains more homogeneous data with the training set.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2303.10949

Country:

North America > United States > Washington > King County > Seattle (0.04)
Asia > East Asia (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.91)

Add feedback