Fixing It in Post: AComparative Study of LLM Post-Training Data Quality and Model Performance

Jun-21-2026, 04:25:57 GMT–Neural Information Processing Systems

Recent work on large language models (LLMs) has increasingly focused on posttraining and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open-and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Jun-21-2026, 04:25:57 GMT

Conferences PDF

Add feedback

Country:
- Europe (0.45)
- Asia (0.27)

Genre:
- Overview (0.92)
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Education (0.67)
- Information Technology (0.45)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found