Not Yet AlphaFold for the Mind: Evaluating Centaur as a Synthetic Participant

Namazova, Sabrina, Brondetta, Alessandra, Strittmatter, Younes, Nassar, Matthew, Musslick, Sebastian

arXiv.org Artificial Intelligence 

Simulators have revolutionized scientific practice across the natural sciences. By generating data that reliably approximate real-world phenomena, they enable scientists to accelerate hypothesis testing and optimize experimental designs [1, 2]. This is perhaps best illustrated by AlphaFold, a Nobel-prize winning simulator in chemistry that predicts protein structures from amino acid sequences, enabling rapid prototyping of molecular interactions, drug targets, and protein functions [1]. In the behavioral sciences, a reliable participant simulator--a system capable of producing human-like behavior across cognitive tasks--would represent a similarly transformative advance [3]. Recently, Binz et al. introduced Centaur, a large language model (LLM) fine-tuned on human data from 160 experiments, proposing its use not only as a model of cognition but also as a participant simulator for "in silico prototyping of experimental studies" [4], e.g., to advance automated cognitive science [3, 5]. Although Centaur demonstrates strong predictive accuracy, its generative behavior-- a critical criterion for a participant simulator--systematically diverges from human data. This suggests that, while Centaur is a significant step toward predicting human behavior, it does not yet meet the standards of a reliable participant simulator or an accurate model of cognition. A core criterion for any behavioral simulator is its ability to generate behavioral patterns observed in experiments.