AITopics | unseen language

Collaborating Authors

unseen language

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

Jung, Haeji, Kim, Jinju, Kim, Kyungjin, Roh, Youjeong, Mortensen, David R.

arXiv.org Artificial IntelligenceOct-14-2025

Transliteration has emerged as a promising means to bridge the gap between various languages in multilingual NLP, showing promising results especially for languages using non-Latin scripts. We investigate the degree to which shared script, overlapping token vocabularies, and shared phonology contribute to performance of multilingual models. To this end, we conduct controlled experiments using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate each model on two downstream tasks -- named entity recognition (NER) and natural language inference (NLI) -- and find that romanization significantly outperforms other input types in 7 out of 8 evaluation settings, largely consistent with our hypothesis that it is the most effective approach. We further analyze how each factor contributed to the success, and suggest that having longer (subword) tokens shared with pre-trained languages leads to better utilization of the model.

artificial intelligence, computational linguistic, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.10827

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

Code-Switching In-Context Learning for Cross-Lingual Transfer of Large Language Models

Yoo, Haneul, Jin, Jiho, Cho, Kyunghyun, Oh, Alice

arXiv.org Artificial IntelligenceOct-8-2025

While large language models (LLMs) exhibit strong multilingual abilities, their reliance on English as latent representations creates a translation barrier, where reasoning implicitly depends on internal translation into English. When this process fails, performance in non-English languages deteriorates sharply, limiting the inclusiveness of LLM-based applications. Existing cross-lingual in-context learning (X-ICL) methods primarily leverage monolingual demonstrations, often failing to mitigate this barrier and instead reinforcing it. In this work, we introduce code-switching in-context learning (CSICL), a simple yet effective prompting strategy that progressively transitions from a target language to English within demonstrations and instruction to facilitate their latent reasoning in English. By explicitly scaffolding the reasoning process through controlled code-switching, CSICL acts as an implicit linguistic bridge that enhances cross-lingual alignment and reduces reliance on the translation barrier. We conduct extensive experiments across 4 LLMs, 6 datasets, and 10 languages, spanning both knowledge-intensive and reasoning-oriented domains. Our results demonstrate that CSICL consistently outperforms X-ICL baselines, achieving gains of 3.1%p and 1.9%p in both target and unseen languages, respectively. The improvement is even more pronounced in low-resource settings, with gains of 14.7% in target and 5.3% in unseen languages. These findings establish code-switching as a principled and robust approach for overcoming the translation barrier during inference, moving LLMs toward more equitable and effective multilingual systems.

computational linguistic, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2510.05678

Country:

Asia (0.93)
North America > United States (0.93)
Europe > United Kingdom > England (0.28)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Multi Language Models for On-the-Fly Syntax Highlighting

Palma, Marco Edoardo, Rani, Pooja, Gall, Harald C.

arXiv.org Artificial IntelligenceOct-7-2025

Syntax highlighting is a critical feature in modern software development environments, enhancing code readability and developer productivity. However, delivering accurate highlighting in real time remains challenging for online and web-based development tools due to strict time and memory constraints on backend services. These systems must serve highlights rapidly and frequently, even when code is partially valid or invalid. This has led to on-the-fly syntax highlighting, where visual annotations are generated just before content is served, often at high request rates and under incomplete input conditions. To meet these demands efficiently, state-of-the-art models use deep learning to learn the behavior of brute-force syntax highlighting resolvers, tools that are easy to implement but too slow for production. Through the Deep Abstraction process, brute-force strategies are encoded into fast statistical models that achieve both high accuracy and low-latency inference. Despite their success, such models face key challenges: they support only one programming language per model, require large datasets from slow brute-force generators, and involve resource-intensive training. In multi-language environments, this means maintaining multiple independent models, increasing system complexity and operational cost. This work addresses these issues by introducing a unified model capable of highlighting up to six mainstream programming languages, reducing deployment complexity by a factor of six and improving performance on unseen languages. A novel normalization technique significantly enhances model generalization, while few-shot learning experiments show that a small number of oracle samples can replace large datasets, minimizing dependence on brute-force generators. Combined, these innovations enable efficient, scalable, and cost-effective syntax highlighting across diverse programming languages.

accuracy, machine learning, programming language, (19 more...)

arXiv.org Artificial Intelligence

2510.04166

Country: Europe > Switzerland (0.28)

Genre:

Research Report > New Finding (0.68)
Research Report > Promising Solution (0.48)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Adapting Language Models to Indonesian Local Languages: An Empirical Study of Language Transferability on Zero-Shot Settings

Putri, Rifki Afina

arXiv.org Artificial IntelligenceJul-3-2025

--In this paper, we investigate the transferability of pre-trained language models to low-resource Indonesian local languages through the task of sentiment analysis. We evaluate both zero-shot performance and adapter-based transfer on ten local languages using models of different types: a monolingual Indonesian BERT, multilingual models such as mBERT and XLM-R, and a modular adapter-based approach called MAD-X. T o better understand model behavior, we group the target languages into three categories: seen (included during pre-training), partially seen (not included but linguistically related to seen languages), and unseen (absent and unrelated in pre-training data). Our results reveal clear performance disparities across these groups: multilingual models perform best on seen languages, moderately on partially seen ones, and poorly on unseen languages. We find that MAD-X significantly improves performance, especially for seen and partially seen languages, without requiring labeled data in the target language. Additionally, we conduct a further analysis on tokenization and show that while subword fragmentation and vocabulary overlap with Indonesian correlate weakly with prediction quality, they do not fully explain the observed performance. Instead, the most consistent predictor of transfer success is the model's prior exposure to the language, either directly or through a related language.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.01645

Country: Asia > Indonesia (0.05)

Genre: Research Report > New Finding (0.89)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.65)

Add feedback

One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers

Abagyan, Diana, Salamanca, Alejandro R., Cruz-Salinas, Andres Felipe, Cao, Kris, Lin, Hangyu, Locatelli, Acyr, Fadaee, Marzieh, Üstün, Ahmet, Hooker, Sara

arXiv.org Artificial IntelligenceJun-13-2025

Pretraining massively multilingual Large Language Models (LLMs) for many languages at once is challenging due to limited model capacity, scarce high-quality data, and compute constraints. Moreover, the lack of language coverage of the tokenizer makes it harder to address the gap for new languages purely at the post-training stage. In this work, we study what relatively cheap interventions early on in training improve "language plasticity", or adaptation capabilities of the model post-training to new languages. We focus on tokenizer design and propose using a universal tokenizer that is trained for more languages than the primary pretraining languages to enable efficient adaptation in expanding language coverage after pretraining. Our systematic experiments across diverse groups of languages and different training strategies show that a universal tokenizer enables significantly higher language adaptation, with up to 20.2% increase in win rates compared to tokenizers specific to pretraining languages. Furthermore, a universal tokenizer also leads to better plasticity towards languages that are completely unseen in the tokenizer and pretraining, by up to 5% win rate gain. We achieve this adaptation to an expanded set of languages with minimal compromise in performance on the majority of languages included in pre-training.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.10766

Country:

North America > United States (1.00)
Europe (1.00)
Asia > Middle East > UAE (0.46)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion

Joglekar, Advait, Singh, Divyanshu, Bhatia, Rooshil Rohit, Umesh, S.

arXiv.org Artificial IntelligenceMay-26-2025

Voice Conversion research in recent times has increasingly focused on improving the zero-shot capabilities of existing methods. Despite remarkable advancements, current architectures still tend to struggle in zero-shot cross-lingual settings. They are also often unable to generalize for speakers of unseen languages and accents. In this paper, we adopt a simple yet effective approach that combines discrete speech representations from self-supervised models with a non-autoregressive Diffusion-Transformer based conditional flow matching speech decoder. We show that this architecture allows us to train a voice-conversion model in a purely textless, self-supervised fashion. Our technique works without requiring multiple encoders to disentangle speech features. Our model also manages to excel in zero-shot cross-lingual settings even for unseen languages. For Demo: https://ez-vc.github.io/EZ-VC-Demo/

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.16691

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

A Post-trainer's Guide to Multilingual Training Data: Uncovering Cross-lingual Transfer Dynamics

Shimabucoro, Luisa, Ustun, Ahmet, Fadaee, Marzieh, Ruder, Sebastian

arXiv.org Artificial IntelligenceApr-24-2025

In order for large language models to be useful across the globe, they are fine-tuned to follow instructions on multilingual data. Despite the ubiquity of such post-training, a clear understanding of the dynamics that enable cross-lingual transfer remains elusive. This study examines cross-lingual transfer (CLT) dynamics in realistic post-training settings. We study two model families of up to 35B parameters in size trained on carefully controlled mixtures of multilingual data on three generative tasks with varying levels of complexity (summarization, instruction following, and mathematical reasoning) in both single-task and multi-task instruction tuning settings. Overall, we find that the dynamics of cross-lingual transfer and multilingual performance cannot be explained by isolated variables, varying depending on the combination of post-training settings. Finally, we identify the conditions that lead to effective cross-lingual transfer in practice.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2504.16677

Country:

Europe (0.93)
North America > Mexico (0.28)

Genre: Research Report > Experimental Study (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)

Add feedback

Continual Learning with Embedding Layer Surgery and Task-wise Beam Search using Whisper

Kwok, Chin Yuen, Yip, Jia Qi, Chng, Eng Siong

arXiv.org Artificial IntelligenceJan-14-2025

Current Multilingual ASR models only support a fraction of the world's languages. Continual Learning (CL) aims to tackle this problem by adding new languages to pre-trained models while avoiding the loss of performance on existing languages, also known as Catastrophic Forgetting (CF). However, existing CL methods overlook the adaptation of the token embedding lookup table at the decoder, despite its significant contribution to CF. We propose Embedding Layer Surgery where separate copies of the token embeddings are created for each new languages, and one of the copies is selected to replace the old languages embeddings when transcribing the corresponding new language. Unfortunately, this approach means LID errors also cause incorrect ASR embedding selection. Our Task-wise Beam Search allows self-correction for such mistakes. By adapting Whisper to 10 hours of data for each of 10 unseen languages from Common Voice, results show that our method reduces the Average WER (AWER) of pre-trained languages from 14.2% to 11.9% compared with Experience Replay, without compromising the AWER of the unseen languages.

arxiv, lookup table, new language, (11 more...)

arXiv.org Artificial Intelligence

2501.07875

Country:

Asia > Singapore (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.64)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.48)

Add feedback

The Impact of Model Scaling on Seen and Unseen Language Performance

Pokharel, Rhitabrat, Nezhad, Sina Bagheri, Agrawal, Ameeta, Singh, Suresh

arXiv.org Artificial IntelligenceJan-9-2025

The rapid advancement of Large Language Models (LLMs), particularly those trained on multilingual corpora, has intensified the need for a deeper understanding of their performance across a diverse range of languages and model sizes. Our research addresses this critical need by studying the performance and scaling behavior of multilingual LLMs in text classification and machine translation tasks across 204 languages. We systematically examine both seen and unseen languages across three model families of varying sizes in zero-shot and few-shot settings. Our findings show significant differences in scaling behavior between zero-shot and two-shot scenarios, with striking disparities in performance between seen and unseen languages. Model scale has little effect on zero-shot performance, which remains mostly flat. However, in two-shot settings, larger models show clear linear improvements in multilingual text classification. For translation tasks, however, only the instruction-tuned model showed clear benefits from scaling. Our analysis also suggests that overall resource levels, not just the proportions of pretraining languages, are better predictors of model performance, shedding light on what drives multilingual LLM effectiveness.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.05629

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > Singapore (0.04)
Europe > Middle East > Malta > Eastern Region > Northern Harbour District > St. Julian's (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration

Lee, Sangmin, Chung, Woo-Jin, Kang, Hong-Goo

arXiv.org Artificial IntelligenceDec-22-2024

Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT). LAMA-UT operates without any language-specific modules while matching the performance of state-of-the-art models trained on a minimal amount of data. Our pipeline consists of two key steps. First, we utilize a universal transcription generator to unify orthographic features into Romanized form and capture common phonetic characteristics across diverse languages. Second, we utilize a universal converter to transform these universal transcriptions into language-specific ones. In experiments, we demonstrate the effectiveness of our proposed method leveraging universal transcriptions for massively multilingual ASR. Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS, despite being trained on only 0.1% of Whisper's training data. Furthermore, our pipeline does not rely on any language-specific modules. However, it performs on par with zero-shot ASR approaches which utilize additional language-specific lexicons and language models. We expect this framework to serve as a cornerstone for flexible multilingual ASR systems that are generalizable even to unseen languages.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.15299

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback