AITopics | romanization

Collaborating Authors

romanization

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

RomanLens: The Role Of Latent Romanization In Multilinguality In LLMs

Saji, Alan, Husain, Jaavid Aktar, Jayakumar, Thanmay, Dabre, Raj, Kunchukuttan, Anoop, Puduppully, Ratish

arXiv.org Artificial IntelligenceFeb-16-2025

Large Language Models (LLMs) exhibit remarkable multilingual generalization despite being predominantly trained on English-centric corpora. A fundamental question arises: how do LLMs achieve such robust multilingual capabilities? We take the case of non-Roman script languages, we investigate the role of Romanization - the representation of non-Roman scripts using Roman characters - as a bridge in multilingual processing. Using mechanistic interpretability techniques, we analyze next-token generation and find that intermediate layers frequently represent target words in Romanized form before transitioning to native script, a phenomenon we term Latent Romanization. Further, through activation patching experiments, we demonstrate that LLMs encode semantic concepts similarly across native and Romanized scripts, suggesting a shared underlying representation. Additionally, for translation into non-Roman script languages, our findings reveal that when the target language is in Romanized form, its representations emerge earlier in the model's layers compared to native script. These insights contribute to a deeper understanding of multilingual representation in LLMs and highlight the implicit role of Romanization in facilitating language transfer.

hindi, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.07424

Country:

Asia (0.93)
North America > United States (0.28)
North America > Mexico (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration

Lee, Sangmin, Chung, Woo-Jin, Kang, Hong-Goo

arXiv.org Artificial IntelligenceDec-22-2024

Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT). LAMA-UT operates without any language-specific modules while matching the performance of state-of-the-art models trained on a minimal amount of data. Our pipeline consists of two key steps. First, we utilize a universal transcription generator to unify orthographic features into Romanized form and capture common phonetic characteristics across diverse languages. Second, we utilize a universal converter to transform these universal transcriptions into language-specific ones. In experiments, we demonstrate the effectiveness of our proposed method leveraging universal transcriptions for massively multilingual ASR. Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS, despite being trained on only 0.1% of Whisper's training data. Furthermore, our pipeline does not rely on any language-specific modules. However, it performs on par with zero-shot ASR approaches which utilize additional language-specific lexicons and language models. We expect this framework to serve as a cornerstone for flexible multilingual ASR systems that are generalizable even to unseen languages.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.15299

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

AyutthayaAlpha: A Thai-Latin Script Transliteration Transformer

Lauc, Davor, Rutherford, Attapol, Wongwarawipatr, Weerin

arXiv.org Artificial IntelligenceDec-5-2024

This study introduces AyutthayaAlpha, an advanced transformer-based machine learning model designed for the transliteration of Thai proper names into Latin script. Our system achieves state-of-the-art performance with 82.32% first-token accuracy and 95.24% first-three-token accuracy, while maintaining a low character error rate of 0.0047. The complexity of Thai phonology, including tonal features and vowel length distinctions, presents significant challenges for accurate transliteration, which we address through a novel two-model approach: AyutthayaAlpha-Small, based on the ByT5 architecture, and AyutthayaAlpha-VerySmall, a computationally efficient variant that unexpectedly outperforms its larger counterpart. Our research combines linguistic rules with deep learning, training on a carefully curated dataset of 1.2 million Thai-Latin name pairs, augmented through strategic upsampling to 2.7 million examples. Extensive evaluations against existing transliteration methods and human expert benchmarks demonstrate that AyutthayaAlpha not only achieves superior accuracy but also effectively captures personal and cultural preferences in name romanization. The system's practical applications extend to cross-lingual information retrieval, international data standardization, and identity verification systems, with particular relevance for government databases, academic institutions, and global business operations. This work represents a significant advance in bridging linguistic gaps between Thai and Latin scripts, while respecting the cultural and personal dimensions of name transliteration.

dataset, romanization, transliteration, (16 more...)

arXiv.org Artificial Intelligence

2412.03877

Country:

Europe > Croatia > Zagreb County > Zagreb (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.67)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.67)

Add feedback

Romanization Encoding For Multilingual ASR

Ding, Wen, Jia, Fei, Xu, Hainan, Xi, Yu, Lai, Junjie, Ginsburg, Boris

arXiv.org Artificial IntelligenceJul-5-2024

We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. By adopting romanization encoding alongside a balanced concatenated tokenizer within a FastConformer-RNNT framework equipped with a Roman2Char module, we significantly reduce vocabulary and output dimensions, enabling larger training batches and reduced memory consumption. Our method decouples acoustic modeling and language modeling, enhancing the flexibility and adaptability of the system. In our study, applying this method to Mandarin-English ASR resulted in a remarkable 63.51% vocabulary reduction and notable performance gains of 13.72% and 15.03% on SEAME code-switching benchmarks. Ablation studies on Mandarin-Korean and Mandarin-Japanese highlight our method's strong capability to address the complexities of other script-heavy languages, paving the way for more versatile and effective multilingual ASR systems.

mandarin, romanization, speech recognition, (12 more...)

arXiv.org Artificial Intelligence

2407.04368

Country:

Asia > China (0.05)
Asia > Singapore (0.04)
Asia > Malaysia (0.04)
(4 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.58)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

CORI: CJKV Benchmark with Romanization Integration -- A step towards Cross-lingual Transfer Beyond Textual Scripts

Nguyen, Hoang H., Zhang, Chenwei, Liu, Ye, Parde, Natalie, Rohrbaugh, Eugene, Yu, Philip S.

arXiv.org Artificial IntelligenceApr-19-2024

Naively assuming English as a source language may hinder cross-lingual transfer for many languages by failing to consider the importance of language contact. Some languages are more well-connected than others, and target languages can benefit from transferring from closely related languages; for many languages, the set of closely related languages does not include English. In this work, we study the impact of source language for cross-lingual transfer, demonstrating the importance of selecting source languages that have high contact with the target language. We also construct a novel benchmark dataset for close contact Chinese-Japanese-Korean-Vietnamese (CJKV) languages to further encourage in-depth studies of language contact. To comprehensively capture contact between these languages, we propose to integrate Romanized transcription beyond textual scripts via Contrastive Learning objectives, leading to enhanced cross-lingual representations and effective zero-shot cross-lingual transfer.

representation, source language, target language, (15 more...)

arXiv.org Artificial Intelligence

2404.12618

Country:

North America > Canada > Ontario > Toronto (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(8 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models models via Romanization

Husain, Jaavid Aktar, Dabre, Raj, Kumar, Aswanth, Puduppully, Ratish, Kunchukuttan, Anoop

arXiv.org Artificial IntelligenceJan-25-2024

This study addresses the challenge of extending Large Language Models (LLMs) to non-English languages, specifically those using non-Latin scripts. We propose an innovative approach that utilizes the romanized form of text as an interface for LLMs, hypothesizing that its frequent informal use and shared tokens with English enhance cross-lingual alignment. Focusing on Hindi, we demonstrate through Hindi-to-English translation and sentiment analysis tasks that romanized text not only significantly improves inference efficiency due to its lower fertility compared to native text but also achieves competitive performance with limited pre-training. Additionally, our novel multi-script prompting approach, which combines romanized and native texts, shows promise in further enhancing task performance. These findings suggest the potential of romanization in bridging the language gap for LLM applications, with future work aimed at expanding this approach to more languages and tasks.

fine-tuning, hindi, hindi transliterated, (12 more...)

arXiv.org Artificial Intelligence

2401.1428

Country:

Asia > India (0.04)
North America > Dominican Republic (0.04)
Asia > Singapore (0.04)
(5 more...)

Genre: Research Report > New Finding (0.88)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Evaluating Self-supervised Speech Models on a Taiwanese Hokkien Corpus

Chou, Yi-Hui, Chang, Kalvin, Wu, Meng-Ju, Ou, Winston, Bi, Alice Wen-Hsin, Yang, Carol, Chen, Bryan Y., Pai, Rong-Wei, Yeh, Po-Yen, Chiang, Jo-Peng, Phoann, Iu-Tshian, Chang, Winnie, Cui, Chenxuan, Chen, Noel, Shi, Jiatong

arXiv.org Artificial IntelligenceDec-5-2023

Taiwanese Hokkien is declining in use and status due to a language shift towards Mandarin in Taiwan. This is partly why it is a low resource language in NLP and speech research today. To ensure that the state of the art in speech processing does not leave Taiwanese Hokkien behind, we contribute a 1.5-hour dataset of Taiwanese Hokkien to ML-SUPERB's hidden set. Evaluating ML-SUPERB's suite of self-supervised learning (SSL) speech representations on our dataset, we find that model size does not consistently determine performance. In fact, certain smaller models outperform larger ones. Furthermore, linguistic alignment between pretraining data and the target language plays a crucial role.

dataset, representation, syllable, (13 more...)

arXiv.org Artificial Intelligence

2312.06668

Country:

North America > United States > Illinois (0.04)
Asia > Taiwan > Takao Province > Kaohsiung (0.04)
Asia > Taiwan > Taiwan Province > Taipei (0.04)
(8 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)

Add feedback

Romanization-based Large-scale Adaptation of Multilingual Language Models

Purkayastha, Sukannya, Ruder, Sebastian, Pfeiffer, Jonas, Gurevych, Iryna, Vulić, Ivan

arXiv.org Artificial IntelligenceApr-18-2023

Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP. However, their large-scale deployment to many languages, besides pretraining data scarcity, is also hindered by the increase in vocabulary size and limitations in their parameter budget. In order to boost the capacity of mPLMs to deal with low-resource and unseen languages, we explore the potential of leveraging transliteration on a massive scale. In particular, we explore the UROMAN transliteration tool, which provides mappings from UTF-8 to Latin characters for all the writing systems, enabling inexpensive romanization for virtually any language. We first focus on establishing how UROMAN compares against other language-specific and manually curated transliterators for adapting multilingual PLMs. We then study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages. Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups: on languages with unseen scripts and with limited training data without any vocabulary augmentation. Further analyses reveal that an improved tokenizer based on romanized data can even outperform non-transliteration-based methods in the majority of languages.

computational linguistic, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2304.08865

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Dominican Republic (0.05)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
(10 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

CUNI Systems for the WMT22 Czech-Ukrainian Translation Task

Popel, Martin, Libovický, Jindřich, Helcl, Jindřich

arXiv.org Artificial IntelligenceDec-1-2022

We present Charles University submissions to the WMT22 General Translation Shared Task on Czech-Ukrainian and Ukrainian-Czech machine translation. We present two constrained submissions based on block back-translation and tagged back-translation and experiment with rule-based romanization of Ukrainian. Our results show that the romanization only has a minor effect on the translation quality. Further, we describe Charles Translator, a system that was developed in March 2022 as a response to the migration from Ukraine to the Czech Republic. Compared to our constrained systems, it did not use the romanization and used some proprietary data sources.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2212.00486

Country:

Europe > Ukraine (0.26)
North America > United States > California > Los Angeles County > Long Beach (0.14)
Europe > Germany > Saxony > Leipzig (0.05)
(8 more...)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Language-agnostic Multilingual Modeling

Datta, Arindrima, Ramabhadran, Bhuvana, Emond, Jesse, Kannan, Anjuli, Roark, Brian

arXiv.org Machine LearningApr-20-2020

Multilingual Automated Speech Recognition (ASR) systems allow for the joint training of data-rich and data-scarce languages in a single model. This enables data and parameter sharing across languages, which is especially beneficial for the data-scarce languages. However, most state-of-the-art multilingual models require the encoding of language information and therefore are not as flexible or scalable when expanding to newer languages. Language-independent multilingual models help to address this issue, and are also better suited for multicultural societies where several languages are frequently used together (but often rendered with different writing systems). In this paper, we propose a new approach to building a language-agnostic multilingual ASR system which transforms all languages to one writing system through a many-to-one transliteration transducer. Thus, similar sounding acoustics are mapped to a single, canonical target sequence of graphemes, effectively separating the modeling and rendering problems. We show with four Indic languages, namely, Hindi, Bengali, Tamil and Kannada, that the language-agnostic multilingual model achieves up to 10% relative reduction in Word Error Rate (WER) over a language-dependent multilingual model.

multilingual model, speech recognition, training data, (14 more...)

arXiv.org Machine Learning

2004.09571

Country: North America > United States (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.35)

Add feedback