Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning
–Neural Information Processing Systems
Data diversity is crucial for training a strong language model. Yet metrics of diversity often diverge from this goal, measuring variations in heuristic features--like n-grams or embeddings--that are detached from how the model actually performs on a target task. This motivates us to ask: *Can we redefine data diversity--beyond measuring variations in heuristic features--in a way that better predicts model generalization?* Through large-scale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning--as measured by average model performance on unseen out-of-distribution benchmarks. We introduce **G-Vendi**, a metric that quantifies diversity via the entropy of model-induced loss gradients. G-Vendi scales to million-sample datasets and yet consistently outperforms heuristic alternatives, achieving strong correlation ($\text{Spearman's } \rho \approx 0.9$) with out-of-distribution (OOD) performance across both natural language inference (NLI) and math reasoning tasks.
Neural Information Processing Systems
Jun-13-2026, 01:42:16 GMT
- Technology: