AITopics | doremi

Do ReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining Sang Michael Xie

Neural Information Processing SystemsFeb-17-2026, 12:01:26 GMT

We then resample a dataset with these domain weights and train a larger, full-sized model.

domain weight, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Sweden > Uppsala County > Uppsala (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
(2 more...)

Industry: Energy (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

Neural Information Processing SystemsDec-26-2025, 23:09:29 GMT

The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile's default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.

domain weight, doremi, optimizing data mixture speed, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.80)

Add feedback

Do ReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining Sang Michael Xie

Neural Information Processing SystemsOct-9-2025, 09:21:19 GMT

We then resample a dataset with these domain weights and train a larger, full-sized model.

domain weight, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Sweden > Uppsala County > Uppsala (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
(2 more...)

Industry: Energy (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

DIDS: Domain Impact-aware Data Sampling for Large Language Model Training

Shi, Weijie, Zhang, Jipeng, Wu, Yaguang, Fang, Jingzhi, Zhang, Ruiyuan, Xu, Jiajie, Zhu, Jia, Chen, Hao, Zhao, Yao, Han, Sirui, Zhou, Xiaofang

arXiv.org Artificial IntelligenceAug-25-2025

Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model's output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency. The code is available at https://github.com/shiweijiezero/DIDS.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2504.13227

Country: Asia (0.46)

Genre: Research Report > New Finding (0.67)

Industry: Education > Curriculum > Subject-Specific Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

Neural Information Processing SystemsMay-27-2025, 12:43:41 GMT

The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain.

artificial intelligence, domain weight, natural language, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.65)

Add feedback

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

Neural Information Processing SystemsJan-20-2025, 00:26:27 GMT

The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain.

domain weight, doremi, optimizing data mixture speed, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.65)

Add feedback

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

Xie, Sang Michael, Pham, Hieu, Dong, Xuanyi, Du, Nan, Liu, Hanxiao, Lu, Yifeng, Liang, Percy, Le, Quoc V., Ma, Tengyu, Yu, Adams Wei

arXiv.org Artificial IntelligenceNov-20-2023

The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile's default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.

domain weight, doremi, proxy model, (14 more...)

arXiv.org Artificial Intelligence

2305.10429

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Sweden > Uppsala County > Uppsala (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
(2 more...)

Genre: Research Report (0.84)

Industry: Energy (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment

Guo, Yanjiang, Wang, Yen-Jen, Zha, Lihan, Jiang, Zheyuan, Chen, Jianyu

arXiv.org Artificial IntelligenceSep-30-2023

Large language models (LLMs) encode a vast amount of semantic knowledge and possess remarkable understanding and reasoning capabilities. Previous work has explored how to ground LLMs in robotic tasks to generate feasible and executable textual plans. However, low-level execution in the physical world may deviate from the high-level textual plan due to environmental perturbations or imperfect controller design. In this paper, we propose \textbf{DoReMi}, a novel language model grounding framework that enables immediate Detection and Recovery from Misalignments between plan and execution. Specifically, we leverage LLMs to play a dual role, aiding not only in high-level planning but also generating constraints that can indicate misalignment during execution. Then vision language models (VLMs) are utilized to detect constraint violations continuously. Our pipeline can monitor the low-level execution and enable timely recovery if certain plan-execution misalignment occurs. Experiments on various complex tasks including robot arms and humanoid robots demonstrate that our method can lead to higher task success rates and shorter task completion times. Videos of DoReMi are available at \url{https://sites.google.com/view/doremi-paper}.

constraint, green block, red block, (14 more...)

arXiv.org Artificial Intelligence

2307.00329

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)

Add feedback

Filters

Collaborating Authors

doremi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Do ReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining Sang Michael Xie

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

Do ReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining Sang Michael Xie

DIDS: Domain Impact-aware Data Sampling for Large Language Model Training

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment