AITopics | language model pretraining

Technology: Information Technology > Artificial Intelligence (0.80)

Neural Information Processing SystemsDec-24-2025, 21:06:27 GMT

COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

We present a self-supervised learning framework, COCO-LM, that pretrains Language Models by COrrecting and COntrasting corrupted text sequences. Following ELECTRA-style pretraining, COCO-LM employs an auxiliary language model to corrupt text sequences, upon which it constructs two new tasks for pretraining the main model. The first token-level task, Corrective Language Modeling, is to detect and correct tokens replaced by the auxiliary model, in order to better capture token-level semantics. The second sequence-level task, Sequence Contrastive Learning, is to align text sequences originated from the same source input while ensuring uniformity in the representation space. Experiments on GLUE and SQuAD demonstrate that COCO-LM not only outperforms recent state-of-the-art pretrained models in accuracy, but also improves pretraining efficiency. It achieves the MNLI accuracy of ELECTRA with 50% of its pretraining GPU hours. With the same pretraining steps of standard base/large-sized models, COCO-LM outperforms the previous best models by 1+ GLUE average points.

coco-lm, correcting and contrasting text sequence, language model pretraining, (4 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.78)

Interrante-Grant, Alexander, Varela-Rosa, Carla, Narayan, Suhaas, Connelly, Chris, Reuther, Albert

Scaling Performance of Large Language Model Pretraining

arXiv.org Artificial IntelligenceOct-10-2025

Training these models is an extremely computationally expensive task; frontier Artificial Intelligence (AI) research companies are investing billions of dollars into supercomputing infrastructure to train progressively larger models on increasingly massive datasets. Unfortunately, very little information about the scaling performance and training considerations of these large training pipelines is released publicly. Working with very large datasets and models can be complex and practical recommendations are scarce in the public literature for tuning training performance when scaling up large language models. In this paper, we aim to demystify the large language model pretraining pipeline somewhat - in particular with respect to distributed training, managing large datasets across hundreds of nodes, and scaling up data parallelism with an emphasis on fully leveraging available GPU compute capacity. Index T erms--large language models, distributed training, data parallelism.

large language model, machine learning, natural language, (19 more...)

2509.05258

Country:

North America > United States > Massachusetts > Middlesex County > Lexington (0.05)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)

Genre: Research Report (0.66)

Industry:

Government > Regional Government (0.50)
Government > Military (0.31)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.75)

Xie, Wanyun, Tonin, Francesco, Cevher, Volkan

Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning

arXiv.org Artificial IntelligenceJun-2-2025

Training data mixtures greatly impact the generalization performance of large language models. Existing domain reweighting methods often rely on costly weight computations and require retraining when new data is introduced. To this end, we introduce a flexible and efficient data mixing framework, Chameleon, that employs leverage scores to quantify domain importance within a learned embedding space. We first construct a domain affinity matrix over domain embeddings. The induced leverage scores determine a mixture that upweights domains sharing common representations in embedding space. This formulation allows direct transfer to new data by computing the new domain embeddings. In experiments, we demonstrate improvements over three key scenarios: (i) our computed weights improve performance on pretraining domains with a fraction of the compute of existing methods; (ii) Chameleon can adapt to data changes without proxy retraining, boosting few-shot reasoning accuracies when transferred to new data; (iii) our method enables efficient domain reweighting in finetuning, consistently improving test perplexity on all finetuning domains over uniform mixture. Our code is available at https://github.com/LIONS-EPFL/Chameleon.

large language model, machine learning, natural language, (16 more...)

2505.24844

Country:

North America > United States (0.28)
North America > Canada > Ontario > Toronto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report > New Finding (0.92)

Industry:

Education (0.46)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Neural Information Processing SystemsMay-27-2025, 12:43:41 GMT

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain.

artificial intelligence, domain weight, natural language, (9 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.65)

arXiv.org Artificial IntelligenceFeb-16-2025

AdaGC: Improving Training Stability for Large Language Model Pretraining

Wang, Guoxia, Li, Shuai, Chen, Congliang, Zeng, Jinle, Yang, Jiabin, Sun, Tao, Ma, Yanjun, Yu, Dianhai, Shen, Li

Large Language Models (LLMs) face increasing loss spikes during scaling, undermining training stability and final performance. While gradient clipping mitigates this issue, traditional global approaches poorly handle parameter-specific gradient variations and decaying gradient norms. We propose **AdaGC**, an adaptive gradient clipping framework that automatically adjusts local thresholds per parameter through exponential moving average of gradient norms. Theoretical analysis proves AdaGC's convergence under non-convex conditions. Extensive experiments demonstrate significant improvements: On Llama-2 7B/13B, AdaGC completely eliminates loss spikes while reducing WikiText perplexity by 3.5% (+0.14pp LAMBADA accuracy) for 7B and achieving 0.65% lower training loss with 1.47% reduced validation perplexity for 13B compared to global clipping. For CLIP ViT-Base, AdaGC converges 25% faster than StableAdamW with full spike elimination. The method shows universal effectiveness across architectures (Llama-2 7B/13B) and modalities (CLIP), with successful integration into diverse optimizers like AdamW and Lion. Source code will be released on GitHub.

adagc, large language model, machine learning, (17 more...)

2502.11034

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsJan-20-2025, 00:26:27 GMT

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain.

domain weight, doremi, optimizing data mixture speed, (7 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.65)

Neural Information Processing SystemsJan-19-2025, 01:04:45 GMT

COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

We present a self-supervised learning framework, COCO-LM, that pretrains Language Models by COrrecting and COntrasting corrupted text sequences. Following ELECTRA-style pretraining, COCO-LM employs an auxiliary language model to corrupt text sequences, upon which it constructs two new tasks for pretraining the main model. The first token-level task, Corrective Language Modeling, is to detect and correct tokens replaced by the auxiliary model, in order to better capture token-level semantics. The second sequence-level task, Sequence Contrastive Learning, is to align text sequences originated from the same source input while ensuring uniformity in the representation space. Experiments on GLUE and SQuAD demonstrate that COCO-LM not only outperforms recent state-of-the-art pretrained models in accuracy, but also improves pretraining efficiency.

coco-lm, correcting and contrasting text sequence, language model pretraining, (1 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Bădoiu, Vlad-Andrei, Dumitru, Mihai-Valentin, Gherghescu, Alexandru M., Agache, Alexandru, Raiciu, Costin

FuLG: 150B Romanian Corpus for Language Model Pretraining

arXiv.org Artificial IntelligenceJul-18-2024

Research in the field of language models is rapidly evolving, with many open models being released to the public. Openly available pretraining corpora usually focus on only a handful of languages, with many others either missing completely or extremely underrepresented. In this report, we introduce FuLG, a hundred-fifty-billion-token Romanian corpus extracted from CommonCrawl. We present our methodology for filtering FuLG and compare it via ablation studies against existing Romanian corpora.

arxiv preprint arxiv, dataset, fulg, (10 more...)

2407.13657

Country:

Europe > Romania > București - Ilfov Development Region > Municipality of Bucharest > Bucharest (0.07)
Europe > Finland (0.05)
Europe > Czechia > Moravian-Silesian Region > Ostrava (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.47)

arXiv.org Artificial IntelligenceJul-11-2024

Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining

Ge, Ce, Ma, Zhijian, Chen, Daoyuan, Li, Yaliang, Ding, Bolin

Large language models exhibit exceptional generalization capabilities, primarily attributed to the utilization of diversely sourced data. However, conventional practices in integrating this diverse data heavily rely on heuristic schemes, lacking theoretical guidance. This research tackles these limitations by investigating strategies based on low-cost proxies for data mixtures, with the aim of streamlining data curation to enhance training efficiency. Specifically, we propose a unified scaling law, termed $\textbf{BiMix}$, which accurately models the bivariate scaling behaviors of both data quantity and mixing proportions. We conduct systematic experiments and provide empirical evidence for the predictive power and fundamental principles of $\textbf{BiMix}$. Notably, our findings reveal that entropy-driven training-free data mixtures can achieve comparable or even better performance than more resource-intensive methods. We hope that our quantitative insights can shed light on further judicious research and development in cost-effective language modeling.

bivariate scaling law, data mixing, language model pretraining

2405.14908

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)