Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLMReasoning
–Neural Information Processing Systems
Effective generalization in language models depends critically on the diversity of their training data. Yet existing diversity metrics often fall short of this goal, relying on surface-level heuristics that are decoupled from model behavior. This motivates us to ask: What kind of diversity in training data actually drives generalization in language models--and how can we measure and amplify it? Through largescale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning--as measured by average model performance on unseen out-of-distribution benchmarks. We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients. Despite using a small off-the-shelf proxy model for gradients, G-Vendi consistently outperforms alternative measures, achieving strong correlation (Spearman's ρ 0.9) with outof-distribution (OOD) performance on both natural language inference (NLI) and math reasoning tasks.
Neural Information Processing Systems
Jun-19-2026, 04:16:42 GMT
- Country:
- North America > United States (1.00)
- Europe (1.00)
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Education > Educational Setting (0.92)
- Media (0.67)
- Leisure & Entertainment > Sports
- Horse Racing (0.93)
- Tennis (0.67)
- Technology: