AITopics | mixture weight

From Collapse to Improvement: Statistical Perspectives on the Evolutionary Dynamics of Iterative Training on Contaminated Sources

arXiv.org Machine LearningFeb-19-2026

The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a statistical viewpoint, illustrating that one can actually hope for improvement when models are trained on data contaminated with synthetic samples, as long as there is some amount of fresh information from the true target distribution. In particular, we consider iterative training on samples sourced from a mixture of the true target and synthetic distributions. We analyze the entire iterative evolution in a next-token prediction language model, capturing how the interplay between the mixture weights and the sample size controls the overall long-term performance. With non-trivial mixture weight of the true distribution, even if it decays over time, simply training the model in a contamination-agnostic manner with appropriate sample sizes can avoid collapse and even recover the true target distribution under certain conditions. Simulation studies support our findings and also show that such behavior is more general for other classes of models.

large language model, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2602.10531

Country: North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.88)
(2 more...)

Add feedback

Theoretical guarantees for EM under misspecified Gaussian mixture models

Raaz Dwivedi, nhật Hồ, Koulik Khamaru, Martin J. Wainwright, Michael I. Jordan

Neural Information Processing SystemsFeb-14-2026, 03:07:23 GMT

Recent years have witnessed substantial progress in understanding the behavior of EM for mixture models that are correctly specified.

algorithm, artificial intelligence, true model, (16 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.05)
North America > Canada (0.04)

Technology: Information Technology > Artificial Intelligence (0.33)

Add feedback

ebea2325dc670423afe9a1f4d9d1aef5-Paper.pdf

Neural Information Processing SystemsFeb-10-2026, 23:37:11 GMT

algorithm, loss function, objective, (15 more...)

Neural Information Processing Systems

Country: North America > Canada (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.95)
Information Technology > Artificial Intelligence > Natural Language (0.94)

Add feedback

e3a54649aeec04cf1c13907bc6c5c8aa-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-10-2026, 20:36:16 GMT

dataset, exponential family, nomt, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.32)

Add feedback

Proof: Whenn 2d 1,n 1kisamonotonically increasingfunctionofkfork =0,,d 1. Wethenhave

Neural Information Processing SystemsFeb-8-2026, 11:44:16 GMT

B.2.2 ExperimentSetup We cannot evaluate the model performance the same as in simulation studies since we no longer have the ground truth knowledge.

category, conv, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.47)

Add feedback

3d6d1bdb10e7c4855721bc44e992585c-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 11:44:13 GMT

algorithm, decision maker, probability, (16 more...)

Neural Information Processing Systems

Country:

Europe > Austria > Vienna (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

0fa81c3f0d57f95b8776de3a248ef0ed-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-7-2026, 22:12:45 GMT

algorithm, algorithm 1, neural network, (17 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Pennsylvania (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Mixture Weight Estimation and Model Prediction in Multi-source Multi-target Domain Adaptation

Neural Information Processing SystemsFeb-7-2026, 22:12:41 GMT

However, there are still two unsolved problems.

algorithm, artificial intelligence, machine learning, (19 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Pennsylvania (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Agnostic Learning with Multiple Objectives

Neural Information Processing SystemsDec-24-2025, 20:32:39 GMT

Most machine learning tasks are inherently multi-objective. This means that the learner has to come up with a model that performs well across a number of base objectives $\cL_{1}, \ldots, \cL_{p}$, as opposed to a single one. Since optimizing with respect to multiple objectives at the same time is often computationally expensive, the base objectives are often combined in an ensemble $\sum_{k=1}^{p}\lambda_{k}\cL_{k}$, thereby reducing the problem to scalar optimization. The mixture weights $\lambda_{k}$ are set to uniform or some other fixed distribution, based on the learner's preferences. We argue that learning with a fixed distribution on the mixture weights runs the risk of overfitting to some individual objectives and significantly harming others, despite performing well on an entire ensemble. Moreover, in reality, the true preferences of a learner across multiple objectives are often unknown or hard to express as a specific distribution. Instead, we propose a new framework of \emph{Agnostic Learning with Multiple Objectives} ($\almo$), where a model is optimized for \emph{any} weights in the mixture of base objectives. We present data-dependent Rademacher complexity guarantees for learning in the $\almo$ framework, which are used to guide a scalable optimization algorithm and the corresponding regularization.

artificial intelligence, machine learning, proceedings, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Nemotron-CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

Diao, Shizhe, Yang, Yu, Fu, Yonggan, Dong, Xin, Su, Dan, Kliegl, Markus, Chen, Zijia, Belcak, Peter, Suhara, Yoshi, Yin, Hongxu, Patwary, Mostofa, Yingyan, null, Lin, null, Kautz, Jan, Molchanov, Pavlo

arXiv.org Artificial IntelligenceDec-2-2025

Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (Nemotron-CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, Nemotron-CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce Nemotron-ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and Nemotron-ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: https://research.nvidia.com/labs/lpr/climb/

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2504.13161

Country: