AITopics | madlad-400

d49042a5d49818711c401d34172f9900-Paper-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsApr-29-2026, 21:49:11 GMT

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

Asia (1.00)
Europe (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

A Appendix

Neural Information Processing SystemsFeb-17-2026, 07:56:21 GMT

The complete list may be seen in Table 8. Here are a few general notes about these strings: 1. Based on their recommendations, we did the following: 1. zh, zh_Latn: This resulted in the special filters described below. URLs) the corpora were in languages different from the LangID predictions. This is mainly mis-rendered PDFs and may have practical applications for denoising, or for decoding such garbled PDFs.

latn, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Oceania > Tonga (0.04)
North America > United States (0.04)
South America > Peru > Huánuco Department > Huánuco Province > Huánuco (0.04)
(24 more...)

Industry: Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media (0.67)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

A Multilingual And Document Level Large Audited

Neural Information Processing SystemsFeb-17-2026, 07:56:18 GMT

We then train and release a 10.7B-parameter

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

Asia > Myanmar (0.04)
Asia > Indonesia > Bali (0.04)
Asia > India (0.04)
(8 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

Neural Information Processing SystemsDec-26-2025, 21:05:45 GMT

We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages using publicly available data, and find that it is competitive with models that are significantly larger, and report the results on different domains. In addition, we train a 8B-parameter language model, and assess the results on few-shot translation. We make the baseline models available to the research community.

madlad-400, multilingual and document-level, name change, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.62)

Add feedback

HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Oepen, Stephan, Arefev, Nikolay, Aulamo, Mikko, Bañón, Marta, Buljan, Maja, Burchell, Laurie, Charpentier, Lucas, Chen, Pinzhen, Fedorova, Mariya, de Gibert, Ona, Haddow, Barry, Hajič, Jan, Helcl, Jindřich, Kutuzov, Andrey, Laippala, Veronika, Li, Zihao, Luukkonen, Risto, Malik, Bhavitvya, Mikhailov, Vladislav, Myntti, Amanda, O'Brien, Dayyán, Poláková, Lucie, Pyysalo, Sampo, Sánchez, Gema Ramírez, Siewert, Janine, Stepachev, Pavel, Tiedemann, Jörg, Vahtola, Teemu, Variš, Dušan, Vitiugin, Fedor, Vojtěchová, Tea, Zaragoza, Jaume

arXiv.org Artificial IntelligenceNov-6-2025

We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.

computational linguistic, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2511.01066

Country:

Europe > Austria (0.29)
North America > Mexico (0.28)
Europe > Finland (0.28)
(2 more...)

Genre: Research Report (1.00)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)

Add feedback

A Appendix A.1 LangID Details

Neural Information Processing SystemsOct-9-2025, 08:30:30 GMT

The complete list may be seen in Table 8. Here are a few general notes about these strings: 1. Based on their recommendations, we did the following: 1. zh, zh_Latn: This resulted in the special filters described below. URLs) the corpora were in languages different from the LangID predictions. This is mainly mis-rendered PDFs and may have practical applications for denoising, or for decoding such garbled PDFs.

latn, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Oceania > Tonga (0.04)
North America > United States (0.04)
South America > Peru > Huánuco Department > Huánuco Province > Huánuco (0.04)
(24 more...)

Industry: Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media (0.67)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

Neural Information Processing SystemsJan-19-2025, 23:14:40 GMT

We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages using publicly available data, and find that it is competitive with models that are significantly larger, and report the results on different domains. In addition, we train a 8B-parameter language model, and assess the results on few-shot translation. We make the baseline models available to the research community.

audited dataset, madlad-400, multilingual and document-level

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.67)

Add feedback

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

Kudugunta, Sneha, Caswell, Isaac, Zhang, Biao, Garcia, Xavier, Choquette-Choo, Christopher A., Lee, Katherine, Xin, Derrick, Kusupati, Aditya, Stella, Romi, Bapna, Ankur, Firat, Orhan

arXiv.org Artificial IntelligenceSep-8-2023

We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages using publicly available data, and find that it is competitive with models that are significantly larger, and report the results on different domains. In addition, we train a 8B-parameter language model, and assess the results on few-shot translation. We make the baseline models available to the research community.

dataset, latn, madlad-400, (15 more...)

arXiv.org Artificial Intelligence

2309.04662

Country: