Statsformer: Validated Ensemble Learning with LLM-Derived Semantic Priors

Zhang, Erica, Sagan, Naomi, Tse, Danny, Zhang, Fangzhao, Pilanci, Mert, Blanchet, Jose

Jan-30-2026–arXiv.org Machine Learning

We introduce Statsformer, a principled framework for integrating large language model (LLM)-derived knowledge into supervised statistical learning. Existing approaches are limited in adaptability and scope: they either inject LLM guidance as an unvalidated heuristic, which is sensitive to LLM hallucination, or embed semantic information within a single fixed learner. Statsformer overcomes both limitations through a guardrailed ensemble architecture. We embed LLM-derived feature priors within an ensemble of linear and nonlinear learners, adaptively calibrating their influence via cross-validation. This design yields a flexible system with an oracle-style guarantee that it performs no worse than any convex combination of its in-library base learners, up to statistical error. Empirically, informative priors yield consistent performance improvements, while uninformative or misspecified LLM guidance is automatically downweighted, mitigating the impact of hallucinations across a diverse range of prediction tasks.

large language model, machine learning, statsformer, (18 more...)

arXiv.org Machine Learning

Jan-30-2026

arXiv.org PDF

Add feedback

Country:
- Europe
  - Italy > Apulia
    - Bari (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
- North America
  - Canada > British Columbia (0.04)
  - United States
    - California
      - Alameda County > Berkeley (0.04)
      - Santa Clara County > Palo Alto (0.04)
    - Florida > Palm Beach County
      - Boca Raton (0.04)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Health & Medicine
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area > Oncology (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found