Statsformer: Validated Ensemble Learning with LLM-Derived Semantic Priors
Zhang, Erica, Sagan, Naomi, Tse, Danny, Zhang, Fangzhao, Pilanci, Mert, Blanchet, Jose
We introduce Statsformer, a principled framework for integrating large language model (LLM)-derived knowledge into supervised statistical learning. Existing approaches are limited in adaptability and scope: they either inject LLM guidance as an unvalidated heuristic, which is sensitive to LLM hallucination, or embed semantic information within a single fixed learner. Statsformer overcomes both limitations through a guardrailed ensemble architecture. We embed LLM-derived feature priors within an ensemble of linear and nonlinear learners, adaptively calibrating their influence via cross-validation. This design yields a flexible system with an oracle-style guarantee that it performs no worse than any convex combination of its in-library base learners, up to statistical error. Empirically, informative priors yield consistent performance improvements, while uninformative or misspecified LLM guidance is automatically downweighted, mitigating the impact of hallucinations across a diverse range of prediction tasks.
Jan-30-2026
- Country:
- Europe
- Italy > Apulia
- Bari (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Italy > Apulia
- North America
- Canada > British Columbia (0.04)
- United States
- California
- Alameda County > Berkeley (0.04)
- Santa Clara County > Palo Alto (0.04)
- Florida > Palm Beach County
- Boca Raton (0.04)
- California
- Europe
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Technology: