Goto

Collaborating Authors

 frontier




The UK government is backing AI that can run its own lab experiments

MIT Technology Review

A competition calling for research projects involving so-called AI scientists shows just how fast this technology is moving. A number of startups and universities that are building "AI scientists" to design and run experiments in the lab, including robot biologists and chemists, have just won extra funding from the UK government agency that funds moonshot R&D. The competition, set up by ARIA (the Advanced Research and Invention Agency), gives a clear sense of how fast this technology is moving: The agency received 245 proposals from research teams that are already building tools capable of automating increasing amounts of lab work. ARIA defines an AI scientist as a system that can run an entire scientific workflow, coming up with hypotheses, designing and running experiments to test those hypotheses, and then analyzing the results. In many cases, the system may then feed those results back into itself and run the loop again and again. Human scientists become overseers, coming up with the initial research questions and then letting the AI scientist get on with the grunt work.


Exploring the Edges of Latent State Clusters for Goal-Conditioned Reinforcement Learning

Neural Information Processing Systems

Exploring unknown environments efficiently is a fundamental challenge in unsupervised goal-conditioned reinforcement learning. While selecting exploratory goals at the frontier of previously explored states is an effective strategy, the policy during training may still have limited capability of reaching rare goals on the frontier, resulting in reduced exploratory behavior. We propose Cluster Edge Exploration (CE$^2$), a new goal-directed exploration algorithm that when choosing goals in sparsely explored areas of the state space gives priority to goal states that remain accessible to the agent. The key idea is clustering to group states that are easily reachable from one another by the current policy under training in a latent space, and traversing to states holding significant exploration potential on the boundary of these clusters before doing exploratory behavior. In challenging robotics environments including navigating a maze with a multi-legged ant robot, manipulating objects with a robot arm on a cluttered tabletop, and rotating objects in the palm of an anthropomorphic robotic hand, CE$^2$ demonstrates superior efficiency in exploration compared to baseline methods and ablations.


BEAVER: An Efficient Deterministic LLM Verifier

Suresh, Tarun, Wadhwa, Nalin, Banerjee, Debangshu, Singh, Gagandeep

arXiv.org Artificial Intelligence

As large language models (LLMs) transition from research prototypes to production systems, practitioners often need reliable methods to verify that model outputs satisfy required constraints. While sampling-based estimates provide an intuition of model behavior, they offer no sound guarantees. We present BEAVER, the first practical framework for computing deterministic, sound probability bounds on LLM constraint satisfaction. Given any prefix-closed semantic constraint, BEAVER systematically explores the generation space using novel token trie and frontier data structures, maintaining provably sound bounds at every iteration. We formalize the verification problem, prove soundness of our approach, and evaluate BEAVER on correctness verification, privacy verification and secure code generation tasks across multiple state of the art LLMs. BEAVER achieves 6 to 8 times tighter probability bounds and identifies 3 to 4 times more high risk instances compared to baseline methods under identical computational budgets, enabling precise characterization and risk assessment that loose bounds or empirical evaluation cannot provide.


WavefrontDiffusion: Dynamic Decoding Schedule for Improved Reasoning

Yang, Haojin, Hu, Rui, Sun, Zequn, Zhou, Rui, Cai, Yujun, Wang, Yiwei

arXiv.org Artificial Intelligence

Diffusion Language Models (DLMs) have shown strong potential for text generation and are becoming a competitive alternative to autoregressive models. The denoising strategy plays an important role in determining the quality of their outputs. Mainstream denoising strategies include Standard Diffusion and BlockDiffusion. Standard Diffusion performs global denoising without restricting the update range, often finalizing incomplete context and causing premature end-of-sequence predictions. BlockDiffusion updates fixed-size blocks in a preset order, but its rigid structure can break apart coherent semantic units and disrupt reasoning. We present WavefrontDiffusion, a dynamic decoding approach that expands a wavefront of active tokens outward from finalized positions. This adaptive process follows the natural flow of semantic structure while keeping computational cost equal to block-based methods. Across four benchmarks in reasoning and code generation, WavefrontDiffusion achieves state-of-the-art performance while producing outputs with higher semantic fidelity, showing the value of adaptive scheduling for more coherent and efficient generation. Recent advances in large language models (LLMs) have achieved remarkable progress in complex reasoning and structured generation tasks such as mathematical problem solving and code synthesis (OpenAI et al., 2025; DeepSeek-AI et al., 2025). Autoregressive (AR) models remain the dominant paradigm for these tasks due to their stepwise logical consistency (Deletang et al., 2024). However, their strictly sequential nature introduces latency and limits flexibility, which can be problematic in settings that demand both accuracy and responsiveness, such as interactive assistants or real-time code generation. These limitations have motivated the exploration of alternative decoding paradigms that can balance quality, efficiency, and adaptability (Leviathan et al., 2023). Diffusion Language Models (DLMs) have recently emerged as a promising alternative by framing text generation as an iterative denoising process (Gong et al., 2025; Song et al., 2025).


Young children's anthropomorphism of an AI chatbot: Brain activation and the role of parent co-presence

Kim, Pilyoung, Chin, Jenna H., Xie, Yun, Brady, Nolan, Yeh, Tom, Yang, Sujin

arXiv.org Artificial Intelligence

Artificial Intelligence (AI) chatbots powered by a large language model (LLM) are entering young children's learning and play, yet little is known about how young children construe these agents or how such construals relate to engagement. We examined anthropomorphism of a social AI chatbot during collaborative storytelling and asked how children's attributions related to their behavior and prefrontal activation. Children at ages 5-6 (N = 23) completed three storytelling sessions: interacting with (1) an AI chatbot only, (2) a parent only, and (3) the AI and a parent together. After the sessions, children completed an interview assessing anthropomorphism toward both the AI chatbot and the parent. Behavioral engagement was indexed by the conversational turn count (CTC) ratio, and concurrent fNIRS measured oxygenated hemoglobin in bilateral vmPFC and dmPFC regions. Children reported higher anthropomorphism for parents than for the AI chatbot overall, although AI ratings were relatively high for perceptive abilities and epistemic states. Anthropomorphism was not associated with CTC. In the right dmPFC, higher perceptive scores were associated with greater activation during the AI-only condition and with lower activation during the AI+Parent condition. Exploratory analyses indicated that higher dmPFC activation during the AI-only condition correlated with higher end-of-session "scared" mood ratings. Findings suggest that stronger perceptive anthropomorphism can be associated with greater brain activation related to interpreting the AI's mental states, whereas parent co-presence may help some children interpret and regulate novel AI interactions. These results may have design implications for encouraging parent-AI co-use in early childhood.


The Download: AI and coding, and Waymo's aggressive driverless cars

MIT Technology Review

Plus: the FDA's newly-appointed head drug regulator is out AI has already transformed how code is written, but a new wave of autonomous systems promise to make the process even smoother and less prone to making mistakes. Amazon Web Services has just revealed three new "frontier" AI agents, its term for a more sophisticated class of autonomous agents capable of working for days at a time without human intervention. One of them, called Kiro, is designed to work independently without the need for a human to constantly point it in the right direction. Another, AWS Security Agent, scans a project for common vulnerabilities: an interesting development given that many AI-enabled coding assistants can end up introducing errors. To learn more about the exciting direction AI-enhanced coding is heading in, check out our team's reporting: Are we ready for what could happen next? They remember previous sessions and continuously learn from a company's codebase.


Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles

Zhang, Yizhou, Du, Lun

arXiv.org Artificial Intelligence

Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-based sampling. While several of these data-centric strategies reliably improve training efficiency and downstream performance, others fail to provide meaningful gains -- most notably self-generated synthetic data, which often increases dataset volume without enhancing model capability. We formalize data curation as reweighting the sampling distribution and map its effect onto the eigenstructure of the data-induced operator. Our first main result shows that \textbf{static pruning induces a bounded operator and therefore cannot change the spectral tail exponent}; it provides at most finite-region improvements and cannot alter asymptotic neural scaling. Our second result analyzes \textbf{time-dependent data curation}, showing that an ideal oracle capable of tracking spectral residuals and continuously re-normalizing the tail can provably accelerate learning -- although practical systems can only approximate this behavior.


A Rosetta Stone for AI Benchmarks

Ho, Anson, Denain, Jean-Stanislas, Atanasov, David, Albanie, Samuel, Shah, Rohin

arXiv.org Artificial Intelligence

Most AI benchmarks saturate within years or even months after they are introduced, making it hard to study long-run trends in AI capabilities. To address this challenge, we build a statistical framework that stitches benchmarks together, putting model capabilities and benchmark difficulties on a single numerical scale. This acts as a "Rosetta Stone", allowing us to compare models across a wide range of abilities and time, even if they are not evaluated on the same benchmarks. Moreover, this works without assuming how capabilities evolve across time or with training compute. We demonstrate three applications of this framework. First, we use it to measure the speed of AI progress over time, and to forecast future AI capabilities. Second, we estimate the rate of improvements in algorithmic efficiency, finding estimates that are higher, but broadly consistent with prior work. Finally, we find that our approach can be used to detect rapid accelerations in AI progress.