AITopics | Niue

Collaborating Authors

Niue

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

Burchell, Laurie, de Gibert, Ona, Arefyev, Nikolay, Aulamo, Mikko, Bañón, Marta, Chen, Pinzhen, Fedorova, Mariia, Guillou, Liane, Haddow, Barry, Hajič, Jan, Helcl, Jindřich, Henriksson, Erik, Klimaszewski, Mateusz, Komulainen, Ville, Kutuzov, Andrey, Kytöniemi, Joona, Laippala, Veronika, Mæhlum, Petter, Malik, Bhavitvya, Mehryary, Farrokh, Mikhailov, Vladislav, Moghe, Nikita, Myntti, Amanda, O'Brien, Dayyán, Oepen, Stephan, Pal, Proyag, Piha, Jousia, Pyysalo, Sampo, Ramírez-Sánchez, Gema, Samuel, David, Stepachev, Pavel, Tiedemann, Jörg, Variš, Dušan, Vojtěchová, Tereza, Zaragoza-Bernabeu, Jaume

arXiv.org Artificial IntelligenceMar-14-2025

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

artificial intelligence, machine translation, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.10267

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
Europe > Russia (0.04)
(66 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology (0.67)
Education (0.46)
Media > News (0.46)
Leisure & Entertainment > Games (0.45)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Programming by Examples Meets Historical Linguistics: A Large Language Model Based Approach to Sound Law Induction

Naik, Atharva, Agrawal, Darsh, Sng, Hong, Marr, Clayton, Zhang, Kexun, Robinson, Nathaniel R, Chang, Kalvin, Byrnes, Rebecca, Mysore, Aravind, Rose, Carolyn, Mortensen, David R

arXiv.org Artificial IntelligenceJan-27-2025

Historical linguists have long written "programs" that convert reconstructed words in an ancestor language into their attested descendants via ordered string rewrite functions (called sound laws) However, writing these programs is time-consuming, motivating the development of automated Sound Law Induction (SLI) which we formulate as Programming by Examples (PBE) with Large Language Models (LLMs) in this paper. While LLMs have been effective for code generation, recent work has shown that PBE is challenging but improvable by fine-tuning, especially with training data drawn from the same distribution as evaluation data. In this paper, we create a conceptual framework of what constitutes a "similar distribution" for SLI and propose four kinds of synthetic data generation methods with varying amounts of inductive bias to investigate what leads to the best performance. Based on the results we create a SOTA open-source model for SLI as PBE (+6% pass rate with a third of the parameters of the second-best LLM) and also highlight exciting future directions for PBE research.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2501.16524

Country:

Oceania > Niue (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York > New York County > New York City (0.04)
(7 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

MIRAI: Evaluating LLM Agents for Event Forecasting

Ye, Chenchen, Hu, Ziniu, Deng, Yihe, Huang, Zijie, Ma, Mingyu Derek, Zhu, Yanqiao, Wang, Wei

arXiv.org Artificial IntelligenceJul-1-2024

Recent advancements in Large Language Models (LLMs) have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents' forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents' abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents' capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.

cameocode, isocode, relation, (15 more...)

arXiv.org Artificial Intelligence

2407.01231

Country:

Asia > North Korea (0.14)
Oceania > Australia > Australian Indian Ocean Territories > Territory of Cocos (Keeling) Islands (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
(234 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Law (1.00)
Government > Foreign Policy (1.00)
Government > Military (0.93)
Information Technology (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

modeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models

Chi, Nathan A., Malchev, Teodor, Kong, Riley, Chi, Ryan A., Huang, Lucas, Chi, Ethan A., McCoy, R. Thomas, Radev, Dragomir

arXiv.org Artificial IntelligenceJun-24-2024

We introduce modeLing, a novel benchmark of Linguistics Olympiad-style puzzles which tests few-shot reasoning in AI systems. Solving these puzzles necessitates inferring aspects of a language's grammatical structure from a small number of examples. Such puzzles provide a natural testbed for language models, as they require compositional generalization and few-shot inductive reasoning. Consisting solely of new puzzles written specifically for this work, modeLing has no risk of appearing in the training data of existing AI systems: this ameliorates the risk of data leakage, a potential confounder for many prior evaluations of reasoning. Evaluating several large open source language models and GPT on our benchmark, we observe non-negligible accuracy, demonstrating few-shot emergent reasoning ability which cannot merely be attributed to shallow memorization. However, imperfect model performance suggests that modeLing can be used to measure further progress in linguistic reasoning.

dataset, puzzle, reasoning, (13 more...)

arXiv.org Artificial Intelligence

2406.17038

Country:

North America > United States > Washington > King County > Seattle (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)
(8 more...)

Genre: Research Report (0.40)

Industry: Education > Educational Setting > K-12 Education > Secondary School (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.66)

Add feedback

Can Large Language Models Code Like a Linguist?: A Case Study in Low Resource Sound Law Induction

Naik, Atharva, Zhang, Kexun, Robinson, Nathaniel, Mysore, Aravind, Marr, Clayton, Byrnes, Hong Sng Rebecca, Cai, Anna, Chang, Kalvin, Mortensen, David

arXiv.org Artificial IntelligenceJun-18-2024

Historical linguists have long written a kind of incompletely formalized ''program'' that converts reconstructed words in an ancestor language into words in one of its attested descendants that consist of a series of ordered string rewrite functions (called sound laws). They do this by observing pairs of words in the reconstructed language (protoforms) and the descendent language (reflexes) and constructing a program that transforms protoforms into reflexes. However, writing these programs is error-prone and time-consuming. Prior work has successfully scaffolded this process computationally, but fewer researchers have tackled Sound Law Induction (SLI), which we approach in this paper by casting it as Programming by Examples. We propose a language-agnostic solution that utilizes the programming ability of Large Language Models (LLMs) by generating Python sound law programs from sound change examples. We evaluate the effectiveness of our approach for various LLMs, propose effective methods to generate additional language-agnostic synthetic data to fine-tune LLMs for SLI, and compare our method with existing automated SLI methods showing that while LLMs lag behind them they can complement some of their weaknesses.

cascade, llm, mortensen, (14 more...)

arXiv.org Artificial Intelligence

2406.12725

Country:

Oceania > Niue (0.05)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York > New York County > New York City (0.04)
(8 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Ranking Entities along Conceptual Space Dimensions with LLMs: An Analysis of Fine-Tuning Strategies

Kumar, Nitesh, Chatterjee, Usashi, Schockaert, Steven

arXiv.org Artificial IntelligenceJun-5-2024

Conceptual spaces represent entities in terms of their primitive semantic features. Such representations are highly valuable but they are notoriously difficult to learn, especially when it comes to modelling perceptual and subjective features. Distilling conceptual spaces from Large Language Models (LLMs) has recently emerged as a promising strategy, but existing work has been limited to probing pre-trained LLMs using relatively simple zero-shot strategies. We focus in particular on the task of ranking entities according to a given conceptual space dimension. Unfortunately, we cannot directly fine-tune LLMs on this task, because ground truth rankings for conceptual space dimensions are rare. We therefore use more readily available features as training data and analyse whether the ranking capabilities of the resulting models transfer to perceptual and subjective features. We find that this is indeed the case, to some extent, but having at least some perceptual and subjective features in the training data seems essential for achieving the best results.

computational linguistic, dataset, knowledge, (14 more...)

arXiv.org Artificial Intelligence

2402.15337

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
Asia > Singapore (0.04)
(60 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Media > Film (0.93)
Leisure & Entertainment (0.93)
Education (0.93)
Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

A lexicon obtained and validated by a data-driven approach for organic residues valorization in emerging and developing countries

Rakotomalala, Christiane, Paillat, Jean-Marie, Feder, Frédéric, Avadí, Angel, Thuriès, Laurent, Vermeire, Marie-Liesse, Médoc, Jean-Michel, Wassenaar, Tom, Hottelart, Caroline, Kieffer, Lilou, Ndjie, Elisa, Picart, Mathieu, Tchamgoue, Jorel, Tulle, Alvin, Valade, Laurine, Boyer, Annie, Duchamp, Marie-Christine, Roche, Mathieu

arXiv.org Artificial IntelligenceJun-2-2024

The text mining method presented in this paper was used for annotation of terms related to biological transformation and valorization of organic residues in agriculture in low and middle-income country. Specialized lexicon was obtained through different steps: corpus and extraction of terms, annotation of extracted terms, selection of relevant terms.

montpellier, recyclage et risque, valorization, (11 more...)

arXiv.org Artificial Intelligence

2406.00682

Country:

Africa > Saint Helena, Ascension and Tristan da Cunha (0.29)
North America > Central America (0.14)
Asia > North Korea (0.14)
(132 more...)

Genre: Research Report (0.64)

Industry: Food & Agriculture > Agriculture (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

Truong, Sang T., Nguyen, Duc Q., Nguyen, Toan, Le, Dong D., Truong, Nhi N., Quan, Tho, Koyejo, Sanmi

arXiv.org Artificial IntelligenceMay-26-2024

We employ Large language models (LLMs) such as GPT-fine-tuning on the LLaMa-2, Mixtral 8 7B, 4 (OpenAI, 2023), BLOOM (Le Scao et al, Gemma, and conduct a comprehensive evaluation 2023), LLaMa-2 (Touvron et al, 2023), Mistral of Vietnamese LLMs across various scenarios and (Jiang et al., 2023), Mixtral (Jiang et al., 2024), settings. Throughout the thorough evaluation process, Gemma (Team et al., 2024) have made significant we observe the following: (i) larger language contributions to the field of natural language processing models exhibit unseen capabilities compared to (NLP). Despite their advancements, a gap smaller counterparts; (ii) larger language models remains in their specialization for many languages, tend to manifest more biases, produce uncalibrated including Vietnamese. This paper addresses the results, and are more susceptible to the influence development and evaluation of Vietnamese-centric of input prompts; (iii) the quality of training or LLMs. Vietnam, with a population surpassing 100 fine-tuning datasets is the key for unlocking LLM million, ranks as the 16th most populous country performance. Our key contributions include: globally.

dataset, gemini, gpt-3, (15 more...)

arXiv.org Artificial Intelligence

2403.02715

Country:

Asia > Middle East > Qatar (0.27)
Europe > Norway (0.14)
Asia > Middle East > Kuwait (0.14)
(100 more...)

Genre: Research Report > New Finding (0.92)

Industry:

Government (1.00)
Education (1.00)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Using Model-Theoretic Approaches to Uncover Linguistic Organization

Griffin, Olivia, Sun, Jerry

arXiv.org Artificial IntelligenceMay-13-2024

Various scholars have proposed the idea that there are different ways for a form-meaning pairing to be iconic, and that these different types of iconicity may interact with one another (Buchler, 1986; Reiger, 1998; Rozhansky, 2015). As a way of formalizing this idea, Lǐ and Ponsford (2018) identify five features pertaining to the form of fully reduplicated words that are in an iconic relationship with some aspect of a meaning that was found to be marked by total reduplication. Based on these formal features, they propose the following five dimensions of iconicity ('iconicities' in Lǐ and Ponsford (2018)) that can be manifested by reduplication patterns: (1) Balinese Pluractional markers keplug'explode' keplug~keplug'explode repeatedly' pa-keplug'X (plural) explode simultaneously' (Arka and Dalrymple, 2017) Notice that the repeated-explosion event is marked by a form that repeats keplug, while the event where all of the explosions happen at once (no repetition) is marked by a form that does not involve any repetition. Viewed through this lens, the Balinese pluractional prefix pa-is not entirely arbitrary, because it highlights the distinction between two types of pluractionality that are marked in Balinese. This is a case of iconicity because a property of the form (repetition or non-repetition) is also a property of the associated meaning. In this paper, we consider pluractional markers in Kaqchikel, Karuk, and Yurok.

complexity, pluractionality, reduplication, (15 more...)

arXiv.org Artificial Intelligence

2405.07597

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
(10 more...)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

Laissez-Faire Harms: Algorithmic Biases in Generative Language Models

Shieh, Evan, Vassel, Faye-Marie, Sugimoto, Cassidy, Monroe-White, Thema

arXiv.org Artificial IntelligenceApr-16-2024

The rapid deployment of generative language models (LMs) has raised concerns about social biases affecting the well-being of diverse consumers. The extant literature on generative LMs has primarily examined bias via explicit identity prompting. However, prior research on bias in earlier language-based technology platforms, including search engines, has shown that discrimination can occur even when identity terms are not specified explicitly. Studies of bias in LM responses to open-ended prompts (where identity classifications are left unspecified) are lacking and have not yet been grounded in end-consumer harms. Here, we advance studies of generative LM bias by considering a broader set of natural use cases via open-ended prompting. In this "laissez-faire" setting, we find that synthetically generated texts from five of the most pervasive LMs (ChatGPT3.5, ChatGPT4, Claude2.0, Llama2, and PaLM2) perpetuate harms of omission, subordination, and stereotyping for minoritized individuals with intersectional race, gender, and/or sexual orientation identities (AI/AN, Asian, Black, Latine, MENA, NH/PI, Female, Non-binary, Queer). We find widespread evidence of bias to an extent that such individuals are hundreds to thousands of times more likely to encounter LM-generated outputs that portray their identities in a subordinated manner compared to representative or empowering portrayals. We also document a prevalence of stereotypes (e.g. perpetual foreigner) in LM-generated outputs that are known to trigger psychological harms that disproportionately affect minoritized individuals. These include stereotype threat, which leads to impaired cognitive performance and increased negative self-perception. Our findings highlight the urgent need to protect consumers from discriminatory harms caused by language models and invest in critical AI education programs tailored towards empowering diverse consumers.

chatgpt3, claude2, dataset, (15 more...)

arXiv.org Artificial Intelligence

2404.07475

Country:

North America > Haiti (0.27)
Europe (0.14)
Asia > Timor-Leste (0.14)
(55 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Law > Civil Rights & Constitutional Law (1.00)
Information Technology (1.00)
Health & Medicine > Therapeutic Area (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback