Goto

Collaborating Authors

 param


DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

arXiv.org Machine Learning

Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established science. We introduce DiscoverPhysics, an interactive benchmark that asks a LLM agent to discover the laws of motion of a simulated world whose physics deliberately deviates from our own. We construct 22 worlds governed by, among others, screened and fractional-power gravity, multi-species couplings, hidden dark-matter-like particles, non-coordinate-free physics, and time-varying interactions. Each world is generated on demand by an N-body simulator, for which the agent proposes several rounds of experiments, observes raw trajectory data, and ultimately submits both a natural-language explanation of the world's physics and a Python implementation of the inferred law. Because solving a world requires the agent to design informative experiments and revise its hypotheses, the benchmark probes long-horizon reasoning over an experimental history. We evaluate submissions along two complementary axes: trajectory MSE on held-out particles and an LLM-judged explanation score following an expert-written rubric assessing conceptual understanding of each world. Across eleven frontier models, we find that the strongest agents pass only half of the worlds and consistently fail on those where latent structure must be uncovered. Open-source models lag substantially behind commercial models, both in their ability to design informative experiments and in extracting conclusions from the data. We further find that good predictive accuracy does not guarantee high explanation quality and that conceptual understanding depends on hypothesis refinement through well-chosen experiments.


Pay Attention to MLPs

Neural Information Processing Systems

Transformers [1] have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple network architecture, gMLP, based on MLPs with gating, and show that it can perform as well as Transformers in key language and vision applications. Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy. For BERT, our model achieves parity with Transformers on pretraining perplexity and is better on some downstream NLP tasks. On finetuning tasks where gMLP performs worse, making the gMLP model substantially larger can close the gap with Transformers. In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.



specifications

Neural Information Processing Systems

This section contains additional details on the object specifications. As mentioned in Section 3, we rely on the PB language to define the structure for each object type that we would like to handle with our model. Our framework supports all basic constructions of the language including nested messages and oneofclauses. For example, in Listing 1b, we can see that a generic Objectcan be either an entityor a constraint. We also use oneoffor objects that may appear in several mutually exclusive configurations (e.g., CircleArcEntityrepresents both arcs and closed circles and for the latter which it does not make sense to specify end points). We handle such constructions by injecting an additional token with the discrete value set to the index of the active field.







Learning to Solve SMT Formulas

Neural Information Processing Systems

Wephrase the challenge ofsolving SMT formulas asatree search problemwhere ateach step atransformation is applied to the input formula until the formula is solved.