Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLMReasoning

Jun-19-2026, 04:16:42 GMT–Neural Information Processing Systems

Effective generalization in language models depends critically on the diversity of their training data. Yet existing diversity metrics often fall short of this goal, relying on surface-level heuristics that are decoupled from model behavior. This motivates us to ask: What kind of diversity in training data actually drives generalization in language models--and how can we measure and amplify it? Through largescale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning--as measured by average model performance on unseen out-of-distribution benchmarks. We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients. Despite using a small off-the-shelf proxy model for gradients, G-Vendi consistently outperforms alternative measures, achieving strong correlation (Spearman's ρ 0.9) with outof-distribution (OOD) performance on both natural language inference (NLI) and math reasoning tasks.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Jun-19-2026, 04:16:42 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (1.00)
- Europe (1.00)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Education > Educational Setting (0.92)
- Media (0.67)
- Leisure & Entertainment > Sports
  - Horse Racing (0.93)
  - Tennis (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.93)
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found