AITopics

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsFeb-17-2026, 04:22:52 GMT

a4628e9fbd3002a554923642f74d5d6b-Paper-Conference.pdf

large language model, machine learning, natural language, (20 more...)

Country: Asia > Thailand > Bangkok > Bangkok (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Neural Information Processing SystemsOct-10-2025, 12:08:42 GMT

a4628e9fbd3002a554923642f74d5d6b-Paper-Conference.pdf

d-cpt law, mixture ratio, validation loss, (15 more...)

Country: Asia > Thailand > Bangkok > Bangkok (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Artificial IntelligenceAug-26-2025

TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training

Wang, Yifan, Liu, Binbin, Liu, Fengze, Guo, Yuanfan, Deng, Jiyao, Wu, Xuecheng, Zhou, Weidong, Zhou, Xiaohuan, Wang, Taifeng

The data mixture used in the pre-training of a language model is a cornerstone of its final performance. However, a static mixing strategy is suboptimal, as the model's learning preferences for various data domains shift dynamically throughout training. Crucially, observing these evolving preferences in a computationally efficient manner remains a significant challenge. To address this, we propose TiKMiX, a method that dynamically adjusts the data mixture according to the model's evolving preferences. TiKMiX introduces Group Influence, an efficient metric for evaluating the impact of data domains on the model. This metric enables the formulation of the data mixing problem as a search for an optimal, influence-maximizing distribution. We solve this via two approaches: TiKMiX-D for direct optimization, and TiKMiX-M, which uses a regression model to predict a superior mixture. We trained models with different numbers of parameters, on up to 1 trillion tokens. TiKMiX-D exceeds the performance of state-of-the-art methods like REGMIX while using just 20% of the computational resources. TiKMiX-M leads to an average performance gain of 2% across 9 downstream benchmarks. Our experiments reveal that a model's data preferences evolve with training progress and scale, and we demonstrate that dynamically adjusting the data mixture based on Group Influence, a direct measure of these preferences, significantly improves performance by mitigating the underdigestion of data seen with static ratios.

arxiv preprint arxiv, machine learning, natural language, (18 more...)

2508.17677

Genre: Research Report > Promising Solution (0.34)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

arXiv.org Artificial IntelligenceJun-13-2025

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Zhang, Mozhi, Tissue, Howe, Wang, Lu, Qiu, Xipeng

We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any dataset into a linear combination of several \emph{meta-domains}, a new concept designed to capture the key underlying features of datasets. \textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \emph{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods. Extensive experiments demonstrate that \textsc{Domain2Vec} helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, \textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only $51.5\%$ of the computation required when training on the original mixture of The Pile dataset. Under equivalent compute budget, \textsc{Domain2Vec} improves downstream performance by an average of $2.83\%$.

data mixture, large language model, machine learning, (16 more...)

2506.10952

Country: North America > United States (0.93)

Genre: Research Report > New Finding (0.67)

Industry: Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
(2 more...)

Neural Information Processing SystemsMay-27-2025, 11:27:22 GMT

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

d-cpt law, domain-specific continual pre-training scaling law, law, (10 more...)

Industry: Law (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceJan-2-2025

Baichuan4-Finance Technical Report

Zhang, Hanyu, Qiu, Boyu, Feng, Yuhao, Li, Shuqi, Ma, Qian, Zhang, Xiyuan, Ju, Qiang, Yan, Dong, Xie, Jian

Large language models (LLMs) have demonstrated strong capabilities in language understanding, generation, and reasoning, yet their potential in finance remains underexplored due to the complexity and specialization of financial knowledge. In this work, we report the development of the Baichuan4-Finance series, including a comprehensive suite of foundational Baichuan4-Finance-Base and an aligned language model Baichuan4-Finance, which are built upon Baichuan4-Turbo base model and tailored for finance domain. Firstly, we have dedicated significant effort to building a detailed pipeline for improving data quality. Moreover, in the continual pre-training phase, we propose a novel domain self-constraint training strategy, which enables Baichuan4-Finance-Base to acquire financial knowledge without losing general capabilities. After Supervised Fine-tuning and Reinforcement Learning from Human Feedback and AI Feedback, the chat model Baichuan4-Finance is able to tackle various financial certification questions and real-world scenario applications. We evaluate Baichuan4-Finance on many widely used general datasets and two holistic financial benchmarks. The evaluation results show that Baichuan4-Finance-Base surpasses almost all competitive baselines on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. At the same time, Baichuan4-Finance demonstrates even more impressive performance on financial application scenarios, showcasing its potential to foster community innovation in the financial LLM field.

arxiv preprint arxiv, baichuan4-finance-base, dataset, (11 more...)

2412.1527

Country: Asia > China (0.05)

Genre: Research Report > New Finding (0.48)

Industry:

Banking & Finance (1.00)
Information Technology > Software (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceSep-10-2024

A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

Xi, Ningyuan, Wu, Yetao, Fan, Kun, Chen, Teng, Gu, Qingqing, Yu, Peng, Qu, Jinxian, Liu, Chenxi, Jiang, Zhonglin, Chen, Yong, Ji, Luo

Large Language Models (LLM) often needs to be Continual Pre-Trained (CPT) to obtain the unfamiliar language skill or adapt into new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study which bridge the gap between the optimal mixture ratio and the actual model performance, and the gap between experimental scaling law and the actual deployment in the full model size. In this paper, we perform CPT on Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size which directly indicate the optimal experimental set up. By thorough choice of hyper-parameter, and subsequent fine-tuning, the model capability is improved not only on the Chinese-related benchmark, but also some specific domains including math, coding and emotional intelligence. We deploy the final 70B version of LLM on an real-life chat system which obtain satisfying performance.

arxiv, benchmark, preprint, (15 more...)

2409.06624

Country:

Europe > Portugal > Lisbon > Lisbon (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Zhejiang Province > Ningbo (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceJul-24-2024

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

Gu, Jiawei, Yang, Zacc, Ding, Chuanghao, Zhao, Rui, Tan, Fei

Large Language Models (LLMs) excel in diverse tasks but often underperform in specialized fields due to limited domain-specific or proprietary corpus. Continual pre-training (CPT) enhances LLM capabilities by imbuing new domain-specific or proprietary knowledge while replaying general corpus to prevent catastrophic forgetting. The data mixture ratio of general corpus and domain-specific corpus, however, has been chosen heuristically, leading to sub-optimal training efficiency in practice. In this context, we attempt to re-visit the scaling behavior of LLMs under the hood of CPT, and discover a power-law relationship between loss, mixture ratio, and training tokens scale. We formalize the trade-off between general and domain-specific capabilities, leading to a well-defined Critical Mixture Ratio (CMR) of general and domain data. By striking the balance, CMR maintains the model's general ability and achieves the desired domain transfer, ensuring the highest utilization of available resources. Therefore, if we value the balance between efficiency and effectiveness, CMR can be consider as the optimal mixture ratio.Through extensive experiments, we ascertain the predictability of CMR, and propose CMR scaling law and have substantiated its generalization. These findings offer practical guidelines for optimizing LLM training in specialized domains, ensuring both general and domain-specific performance while efficiently managing training resources.

arxiv preprint arxiv, cmr, mixture ratio, (13 more...)

2407.17467

Country:

Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Nishimura, Naoki, Kobayashi, Ken, Nakata, Kazuhide

Balancing Immediate Revenue and Future Off-Policy Evaluation in Coupon Allocation

arXiv.org Artificial IntelligenceJul-17-2024

Coupon allocation drives customer purchases and boosts revenue. However, it presents a fundamental trade-off between exploiting the current optimal policy to maximize immediate revenue and exploring alternative policies to collect data for future policy improvement via off-policy evaluation (OPE). While online A/B testing can validate new policies, it risks compromising short-term revenue. Conversely, relying solely on an exploitative policy hinders the ability to reliably estimate and enhance future policies. To balance this trade-off, we propose a novel approach that combines a model-based revenue maximization policy and a randomized exploration policy for data collection. Our framework enables flexibly adjusting the mixture ratio between these two policies to optimize the balance between short-term revenue and future policy improvement. We formulate the problem of determining the optimal mixture ratio between a model-based revenue maximization policy and a randomized exploration policy for data collection. We empirically verified the effectiveness of the proposed mixed policy using both synthetic and real-world data. Our main contributions are: (1) Demonstrating a mixed policy combining deterministic and probabilistic policies, flexibly adjusting the data collection vs. revenue trade-off. (2) Formulating the optimal mixture ratio problem as multi-objective optimization, enabling quantitative evaluation of this trade-off. By optimizing the mixture ratio, businesses can maximize revenue while ensuring reliable future OPE and policy improvement. This framework is applicable in any context where the exploration-exploitation trade-off is relevant.

artificial intelligence, data collection policy, machine learning, (18 more...)

2407.11039

Country: Asia > Japan (0.29)

Genre: Research Report (1.00)

Industry: Energy > Oil & Gas > Upstream (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)