Goto

Collaborating Authors

 official language


I'm a Vietnamese refugee. We are proud to speak the language of our new home as all immigrants should

FOX News

Fox News Flash top headlines are here. Check out what's clicking on Foxnews.com. After the fall of Saigon in 1975, waves of South Vietnamese refugees fled to the United States, seeking freedom and safety. About 125,000 refugees were airlifted initially, with upwards of 800,000 refugees fleeing in the years following – many of whom ended up settling in the U.S. As of 2017, Vietnamese-Americans comprise approximately 3% of America's immigrants, and represent the sixth-largest foreign-born population. Upon resettling in the United States, many refugees encountered a language barrier which made navigating new lives in a new nation a challenge.


Column: How California helped Trump make English the official national language

Los Angeles Times

It was the spring of 1985, and Californians were waging civic war on behalf of English. Some Monterey Park residents were pushing their City Council to ban Chinese-language business signs. Voters who had passed Proposition 38 a year earlier were waiting for Gov. George Deukmejian to implement the initiative, which required that he ask the federal government to print election material only in English. Hayakawa, one of Proposition 38's co-authors, was preparing for Proposition 63, which would enshrine English as the state's official language, after Whittier-area Assemblymember Frank Hill introduced a bill proposing just that. Tiny Fillmore in Ventura County had already become one of the first cities in the country to go English-official.


'It's long past time': Colombian-born GOP senator rallies around making English official language of US

FOX News

FIRST ON FOX: Freshman GOP Sen. Bernie Moreno is introducing a bill that would declare English as the official language of the United States. The bill, named the English Language Unity Act of 2025, would "declare English as the official language of the United States" and "establish a uniform English language rule for naturalization, and to avoid misconstructions of the English language texts of the laws of the United States." Variations of the bill have been put forward in the past, including in 2023 from then Ohio Sen. JD Vance, who said at the time that English "has been a cornerstone of American culture for over 250 years" and that it "is far past time for Congress to codify its place into law, which is exactly what this bill does." In a statement to Fox News Digital, Moreno, who was born in Colombia, said, "JD Vance was right – English is the official language of the United States and, as one of the only naturalized citizens serving in the Senate, I should know." Bernie Moreno has introduced a bill to make English the official language of the United States.


Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages

Tamang, S., Bora, D. J.

arXiv.org Artificial Intelligence

Large Language Models (LLMs) based on transformer architectures have revolutionized a variety of domains, with tokenization playing a pivotal role in their pre-processing and fine-tuning stages. In multilingual models, particularly those tailored for Indic languages, effective tokenization is crucial for optimizing performance. This paper presents a comprehensive evaluation of tokenizers used by 12 LLMs across all 22 official languages of India, with a focus on comparing the efficiency of their tokenization processes. We employed the Normalized Sequence Length (NSL) as a key metric in our analysis. Our findings reveal that the SUTRA tokenizer outperforms all other models, including several Indic-specific models, excelling in 14 languages. Notable insights include the SUTRA tokenizer's superior handling of Indic languages, GPT-4o's advancement over its predecessor GPT-4 in processing Indian languages, and the limited performance of Project Indus in certain languages. This study underscores the critical importance of developing targeted tokenization strategies for multilingual and Indic-centric models, laying the groundwork for future improvements in tokenizer design to enhance linguistic coverage and model efficiency.


Probing Language Models on Their Knowledge Source

Tighidet, Zineddine, Mogini, Andrea, Mei, Jiali, Piwowarski, Benjamin, Gallinari, Patrick

arXiv.org Artificial Intelligence

Large Language Models (LLMs) often encounter conflicts between their learned, internal (parametric knowledge, PK) and external knowledge provided during inference (contextual knowledge, CK). Understanding how LLMs models prioritize one knowledge source over the other remains a challenge. In this paper, we propose a novel probing framework to explore the mechanisms governing the selection between PK and CK in LLMs. Using controlled prompts designed to contradict the model's PK, we demonstrate that specific model activations are indicative of the knowledge source employed. We evaluate this framework on various LLMs of different sizes and demonstrate that mid-layer activations, particularly those related to relations in the input, are crucial in predicting knowledge source selection, paving the way for more reliable models capable of handling knowledge conflicts effectively.


UniArk: Improving Generalisation and Consistency for Factual Knowledge Extraction through Debiasing

Yang, Yijun, He, Jie, Chen, Pinzhen, Gutiérrez-Basulto, Víctor, Pan, Jeff Z.

arXiv.org Artificial Intelligence

Several recent papers have investigated the potential of language models as knowledge bases as well as the existence of severe biases when extracting factual knowledge. In this work, we focus on the factual probing performance over unseen prompts from tuning, and using a probabilistic view we show the inherent misalignment between pre-training and downstream tuning objectives in language models for probing knowledge. We hypothesize that simultaneously debiasing these objectives can be the key to generalisation over unseen prompts. We propose an adapter-based framework, UniArk, for generalised and consistent factual knowledge extraction through simple methods without introducing extra parameters. Extensive experiments show that UniArk can significantly improve the model's out-of-domain generalisation as well as consistency under various prompts. Additionally, we construct ParaTrex, a large-scale and diverse dataset for measuring the inconsistency and out-of-domain generation of models. Further, ParaTrex offers a reference method for constructing paraphrased datasets using large language models.


Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning

Chen, Zhongzhi, Sun, Xingwu, Jiao, Xianfeng, Lian, Fengzong, Kang, Zhanhui, Wang, Di, Xu, Cheng-Zhong

arXiv.org Artificial Intelligence

Despite the great success of large language models (LLMs) in various tasks, they suffer from generating hallucinations. We introduce Truth Forest, a method that enhances truthfulness in LLMs by uncovering hidden truth representations using multi-dimensional orthogonal probes. Specifically, it creates multiple orthogonal bases for modeling truth by incorporating orthogonal constraints into the probes. Moreover, we introduce Random Peek, a systematic technique considering an extended range of positions within the sequence, reducing the gap between discerning and generating truth features in LLMs. By employing this approach, we improved the truthfulness of Llama-2-7B from 40.8\% to 74.5\% on TruthfulQA. Likewise, significant improvements are observed in fine-tuned models. We conducted a thorough analysis of truth features using probes. Our visualization results show that orthogonal probes capture complementary truth-related features, forming well-defined clusters that reveal the inherent structure of the dataset.


Contextualising Levels of Language Resourcedness affecting Digital Processing of Text

Keet, C. Maria, Khumalo, Langa

arXiv.org Artificial Intelligence

Application domains such as digital humanities and tool like chatbots involve some form of processing natural language, from digitising hardcopies to speech generation. The language of the content is typically characterised as either a low resource language (LRL) or high resource language (HRL), also known as resource-scarce and well-resourced languages, respectively. African languages have been characterized as resource-scarce languages (Bosch et al. 2007; Pretorius & Bosch 2003; Keet & Khumalo 2014) and English is by far the most well-resourced language. Varied language resources are used to develop software systems for these languages to accomplish a wide range of tasks. In this paper we argue that the dichotomous typology LRL and HRL for all languages is problematic. Through a clear understanding of language resources situated in a society, a matrix is developed that characterizes languages as Very LRL, LRL, RL, HRL and Very HRL. The characterization is based on the typology of contextual features for each category, rather than counting tools, and motivation is provided for each feature and each characterization. The contextualisation of resourcedness, with a focus on African languages in this paper, and an increased understanding of where on the scale the language used in a project is, may assist in, among others, better planning of research and implementation projects. We thus argue in this paper that the characterization of language resources within a given scale in a project is an indispensable component particularly in the context of low-resourced languages.


Extracting Multi-valued Relations from Language Models

Singhania, Sneha, Razniewski, Simon, Weikum, Gerhard

arXiv.org Artificial Intelligence

The widespread usage of latent language representations via pre-trained language models (LMs) suggests that they are a promising source of structured knowledge. However, existing methods focus only on a single object per subject-relation pair, even though often multiple objects are correct. To overcome this limitation, we analyze these representations for their potential to yield materialized multi-object relational knowledge. We formulate the problem as a rank-then-select task. For ranking candidate objects, we evaluate existing prompting techniques and propose new ones incorporating domain knowledge. Among the selection methods, we find that choosing objects with a likelihood above a learned relation-specific threshold gives a 49.5% F1 score. Our results highlight the difficulty of employing LMs for the multi-valued slot-filling task and pave the way for further research on extracting relational knowledge from latent language representations.


How Many Answers Should I Give? An Empirical Study of Multi-Answer Reading Comprehension

Zhang, Chen, Lin, Jiuheng, Liu, Xiao, Lai, Yuxuan, Feng, Yansong, Zhao, Dongyan

arXiv.org Artificial Intelligence

The multi-answer phenomenon, where a question may have multiple answers scattered in the document, can be well handled by humans but is challenging enough for machine reading comprehension (MRC) systems. Despite recent progress in multi-answer MRC, there lacks a systematic analysis of how this phenomenon arises and how to better address it. In this work, we design a taxonomy to categorize commonly-seen multi-answer MRC instances, with which we inspect three multi-answer datasets and analyze where the multi-answer challenge comes from. We further analyze how well different paradigms of current multi-answer MRC models deal with different types of multi-answer instances. We find that some paradigms capture well the key information in the questions while others better model the relationship between questions and contexts. We thus explore strategies to make the best of the strengths of different paradigms. Experiments show that generation models can be a promising platform to incorporate different paradigms. Our annotations and code are released for further research.