AITopics | Zhang, Xuemiao

Collaborating Authors

Zhang, Xuemiao

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FRAME: Boosting LLMs with A Four-Quadrant Multi-Stage Pretraining Strategy

Zhang, Xuemiao, Duan, Feiyu, Xu, Liangyu, Zhou, Yongwei, Wang, Sirui, Weng, Rongxiang, Wang, Jingang, Cai, Xunliang

arXiv.org Artificial IntelligenceFeb-18-2025

Large language models (LLMs) have significantly advanced human language understanding and generation, with pretraining data quality and organization being crucial to their performance. Multi-stage pretraining is a promising approach, but existing methods often lack quantitative criteria for data partitioning and instead rely on intuitive heuristics. In this paper, we propose the novel Four-quadRAnt Multi-stage prEtraining strategy (FRAME), guided by the established principle of organizing the pretraining process into four stages to achieve significant loss reductions four times. This principle is grounded in two key findings: first, training on high Perplexity (PPL) data followed by low PPL data, and second, training on low PPL difference (PD) data followed by high PD data, both causing the loss to drop significantly twice and performance enhancements. By partitioning data into four quadrants and strategically organizing them, FRAME achieves a remarkable 16.8% average improvement over random across MMLU and CMMLU for the 3B model, effectively boosting LLM performance.

large language model, natural language, ppl, (18 more...)

arXiv.org Artificial Intelligence

2502.05551

Country:

Africa (0.68)
Asia > India (0.67)

Genre: Research Report > New Finding (0.67)

Industry:

Government (1.00)
Law (0.92)
Energy (0.84)
Health & Medicine > Therapeutic Area (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Preference Curriculum: LLMs Should Always Be Pretrained on Their Preferred Data

Zhang, Xuemiao, Xu, Liangyu, Duan, Feiyu, Zhou, Yongwei, Wang, Sirui, Weng, Rongxiang, Wang, Jingang, Cai, Xunliang

arXiv.org Artificial IntelligenceFeb-17-2025

Large language models (LLMs) generally utilize a consistent data distribution throughout the pretraining process. However, as the model's capability improves, it is intuitive that its data preferences dynamically change, indicating the need for pretraining with different data at various training stages. To achieve it, we propose the Perplexity Difference (PD) based Preference Curriculum learning (PDPC) framework, which always perceives and uses the data preferred by LLMs to train and boost them. First, we introduce the PD metric to quantify the difference in how challenging a sample is for weak versus strong models. Samples with high PD are more challenging for weak models to learn and are more suitable to be arranged in the later stage of pretraining. Second, we propose the preference function to approximate and predict the data preference of the LLM at any training step, so as to complete the arrangement of the dataset offline and ensure continuous training without interruption. Experimental results on 1.3B and 3B models demonstrate that PDPC significantly surpasses baselines. Notably, the 3B model trained on 1T tokens achieves an increased average accuracy of over 8.1% across MMLU and CMMLU.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.13126

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Industry:

Health & Medicine (0.68)
Energy (0.47)
Leisure & Entertainment > Sports (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training

Xu, Liangyu, Zhang, Xuemiao, Duan, Feiyu, Wang, Sirui, Wang, Jingang, Cai, Xunliang

arXiv.org Artificial IntelligenceFeb-17-2025

Selecting high-quality data can significantly improve the pretraining efficiency of large language models (LLMs). Existing methods generally rely on heuristic techniques and single-quality signals, limiting their ability to evaluate data quality comprehensively. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points. Experiments on the SlimPajama dataset reveal that FIRE outperforms other data selection methods and significantly enhances the pretrained model across a wide range of downstream tasks, with a 2.9% average performance improvement over Random and reducing the FLOPs necessary to achieve a certain performance level by more than half.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.00761

Country: North America > United States (0.92)

Genre: Research Report > New Finding (0.67)

Industry: Energy > Oil & Gas (0.93)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback