AITopics | latn 2

Collaborating Authors

latn 2

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Appendix

Neural Information Processing SystemsFeb-17-2026, 07:56:21 GMT

The complete list may be seen in Table 8. Here are a few general notes about these strings: 1. Based on their recommendations, we did the following: 1. zh, zh_Latn: This resulted in the special filters described below. URLs) the corpora were in languages different from the LangID predictions. This is mainly mis-rendered PDFs and may have practical applications for denoising, or for decoding such garbled PDFs.

latn, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Oceania > Tonga (0.04)
North America > United States (0.04)
South America > Peru > Huánuco Department > Huánuco Province > Huánuco (0.04)
(24 more...)

Industry: Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media (0.67)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

A Appendix A.1 LangID Details

Neural Information Processing SystemsOct-9-2025, 08:30:30 GMT

latn, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Oceania > Tonga (0.04)
North America > United States (0.04)
South America > Peru > Huánuco Department > Huánuco Province > Huánuco (0.04)
(24 more...)

Industry: Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media (0.67)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

Burchell, Laurie, de Gibert, Ona, Arefyev, Nikolay, Aulamo, Mikko, Bañón, Marta, Chen, Pinzhen, Fedorova, Mariia, Guillou, Liane, Haddow, Barry, Hajič, Jan, Helcl, Jindřich, Henriksson, Erik, Klimaszewski, Mateusz, Komulainen, Ville, Kutuzov, Andrey, Kytöniemi, Joona, Laippala, Veronika, Mæhlum, Petter, Malik, Bhavitvya, Mehryary, Farrokh, Mikhailov, Vladislav, Moghe, Nikita, Myntti, Amanda, O'Brien, Dayyán, Oepen, Stephan, Pal, Proyag, Piha, Jousia, Pyysalo, Sampo, Ramírez-Sánchez, Gema, Samuel, David, Stepachev, Pavel, Tiedemann, Jörg, Variš, Dušan, Vojtěchová, Tereza, Zaragoza-Bernabeu, Jaume

arXiv.org Artificial IntelligenceMar-14-2025

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

artificial intelligence, machine translation, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.10267

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
Europe > Russia (0.04)
(66 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology (0.67)
Education (0.46)
Media > News (0.46)
Leisure & Entertainment > Games (0.45)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining

Liu, Yihong, Lin, Peiqin, Wang, Mingyang, Schütze, Hinrich

arXiv.org Artificial IntelligenceNov-15-2023

Pretraining multilingual language models from scratch requires considerable computational resources and substantial training data. Therefore, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the language model, thus weakening the efficiency. To address these issues, we propose a novel framework: \textbf{O}ne \textbf{F}or \textbf{A}ll (\textbf{\textsc{Ofa}}), which wisely initializes the embeddings of unseen subwords from target languages and thus can adapt a PLM to multiple languages efficiently and effectively. \textsc{Ofa} takes advantage of external well-aligned multilingual word embeddings and injects the alignment knowledge into the new embeddings. In addition, \textsc{Ofa} applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which significantly reduces the number of parameters while not sacrificing the performance. Through extensive experiments, we show models initialized by \textsc{Ofa} are efficient and outperform several baselines. \textsc{Ofa} not only accelerates the convergence of continued pretraining, which is friendly to a limited computation budget, but also improves the zero-shot crosslingual transfer on a wide range of downstream tasks. We make our code and models publicly available.

latn 2, latn 3, latn 4, (15 more...)

arXiv.org Artificial Intelligence

2311.08849

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
North America > United States > New York > New York County > New York City (0.04)
(13 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.87)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)

Add feedback

ChatGPT MT: Competitive for High- (but not Low-) Resource Languages

Robinson, Nathaniel R., Ogayo, Perez, Mortensen, David R., Neubig, Graham

arXiv.org Artificial IntelligenceSep-14-2023

Large language models (LLMs) implicitly learn to perform a range of language tasks, including machine translation (MT). Previous studies explore aspects of LLMs' MT capabilities. However, there exist a wide variety of languages for which recent LLM MT performance has never before been evaluated. Without published experimental evidence on the matter, it is difficult for speakers of the world's diverse languages to know how and whether they can use LLMs for their languages. We present the first experimental evidence for an expansive set of 204 languages, along with MT cost analysis, using the FLORES-200 benchmark. Trends reveal that GPT models approach or exceed traditional MT model performance for some high-resource languages (HRLs) but consistently lag for low-resource languages (LRLs), under-performing traditional MT for 84.1% of languages we covered. Our analysis reveals that a language's resource level is the most important feature in determining ChatGPT's relative ability to translate it, and suggests that ChatGPT is especially disadvantaged for LRLs and African languages.

chatgpt, latn 0, translation, (13 more...)

arXiv.org Artificial Intelligence

2309.07423

Country:

Africa > Niger (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > Denmark > Capital Region > Copenhagen (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

Kudugunta, Sneha, Caswell, Isaac, Zhang, Biao, Garcia, Xavier, Choquette-Choo, Christopher A., Lee, Katherine, Xin, Derrick, Kusupati, Aditya, Stella, Romi, Bapna, Ankur, Firat, Orhan

arXiv.org Artificial IntelligenceSep-8-2023

We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages using publicly available data, and find that it is competitive with models that are significantly larger, and report the results on different domains. In addition, we train a 8B-parameter language model, and assess the results on few-shot translation. We make the baseline models available to the research community.

dataset, latn, madlad-400, (15 more...)

arXiv.org Artificial Intelligence

2309.04662

Country:

Oceania > Tonga (0.04)
North America > United States (0.04)
Asia > Indonesia > Bali (0.04)
(30 more...)

Genre: Research Report (0.63)

Industry: Law (0.67)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Data Science > Data Quality (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback