navigation
HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models
Liu, Emmy, Gangal, Varun, Yu, Michael, Tao, Zhuofu, Singh, Karan, Kumar, Sachin, Feng, Steven Y.
Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across tasks such as summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting actually reduces hallucinations across contexts. Current hallucination benchmarks either require human annotation and fixed references that may eventually be memorized, or rely on naturalistic observations often recorded in settings that are difficult to reproduce or test systematically. To enable further research on the root causes of hallucination, we introduce HALLUWORLD, an extensible benchmark framework grounded in an explicit reference-world formulation: a model hallucinates when it produces an observable claim that is false with respect to this reference world. Building on this view, we construct a family of synthetic and semi-synthetic benchmark environments in which the reference world is fully specified, the model's observable view is controlled, and hallucination labels can be generated automatically by construction. HALLUWORLD spans multiple settings that are classically representative for AI, i.e., gridworlds, chess, and realistic terminal tasks. This enables controlled variation of key factors such as world complexity, observability, temporal change, and source-conflict policy, allowing us to disentangle hallucinations into more fine-grained error categories. We evaluate frontier and open-weight language models across these settings and find consistent patterns across domains: perceptual hallucination on directly observed information is near-solved for frontier models, while multi-step state tracking and causal forward simulation are still difficult for frontier models, and are not generally solved by extended thinking.
Robot Talk Episode 156 – Rugged robots for dangerous missions, with Gavin Kenneally
Gavin Kenneally is the Co-Founder and CEO of Ghost Robotics, a company that has gained a reputation for pushing the boundaries of legged robotics technology. In his current role, Gavin spearheads a team of highly skilled engineers and researchers who share his passion for creating advanced robotics systems. Previously, he was Head of Product at Ghost Robotics, responsible for the mechanical design of the company's flagship product: the Vision 60 Q-UGV. Gavin has a PhD in Mechanical Engineering from the University of Pennsylvania and has authored six academic papers. Robot Talk is a weekly podcast that explores the exciting world of robotics, artificial intelligence and autonomous machines.
Developing active and flexible microrobots
Leiden researchers Professor Daniela Kraft and Mengshi Wei have created microscopic robots that move without sensors, software, or external control. Instead, their behaviour emerges entirely from their shape and the way they interact with their environment. This class of robots opens up entirely new possibilities for biomedical applications. Inspiration to build these robots came from nature. Kraft: "Animals like worms and snakes constantly adapt their shape as they move, which helps them to navigate their environments. Macroscopic robots similarly use flexibility for their function. However, until now, microrobots were either small and rigid, or large and flexible. We wondered if we could realize small and flexible microrobots in our lab."
Is Big Brother watching you shop? – podcast
Is Big Brother watching you shop? - podcast From supermarkets to corner shops, live facial recognition could be coming to retailers near you. Live facial recognition is being hailed as a powerful new frontier in the fight against crime, not only by police but by private companies too. Retailers from supermarkets to corner shops hope it will help them fight back against shoplifting. And the technology doesn't always get it right. With more police forces wanting to take up the technology, what could the consequences be?
Florida students boo graduation speaker who called AI 'next Industrial Revolution'
Florida students boo graduation speaker who called AI'next Industrial Revolution' Real estate executive got an unexpected earful when she spoke of'living in a time of profound change' Though college graduations usually consist of a speaker giving advice to students, one recent ceremony featured students giving the speaker their opinions - loudly. The University of Central Florida's 2026 graduating class booed as a real estate development executive spoke about how "the rise of artificial intelligence is the next Industrial Revolution" and about "living in a time of profound change". US university's commencement speaker reveals he will pay off students' final-year loans The crowd of students was so loud that Gloria Caulfield paused, turned away from the podium and threw her hands up in the air. As the crowd calmed down, Caulfield proceeded. "Only a few years ago, AI was not a factor in our lives."
Design, Cups, and Blankets. A Free-Energy-Principle-Based Approach to Product Design
Classical design theory treats the type of an object as a given: the designer decides in advance that this will be a cup, then optimizes its parameters. This paper argues that object type is not a presupposition but an inference, something that can be determined from physical data and functional requirements jointly. We call this problem requirement-steered interface type inference and show that it is inexpressible within existing design frameworks. This paper makes two contributions that are jointly necessary and individually incomplete. The first is the problem itself, which classical design cannot pose because it presupposes the very thing our problem seeks to determine. The second is C-DMBD, a constrained extension of the Dynamic Markov Blanket Detection algorithm, which makes requirement-steered inference computationally tractable. Drawing on the free-energy principle and active inference, established frameworks in theoretical neuroscience and Bayesian mechanics, we model a product's surface as a Markov blanket: the minimal boundary through which all causal exchange between object and environment must pass. Different blanket structures correspond to different object types; different parameterizations of the same structure correspond to different functional modes of the same type. This paper is a proof of concept and a theoretical proposal. It reframes design as inference rather than optimization, and as a relation between generative models rather than a specification of parameters.
EgoEnv: Human-centric environment representations from egocentric video
First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and capture only what is immediately visible. To facilitate humancentric environment understanding, we present an approach that links egocentric video and the environment by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings. We train such models using videos from agents in simulated 3D environments where the environment is fully observable, and test them on human-captured real-world videos from unseen environments. On two human-centric video tasks, we show that models equipped with our environment-aware features consistently outperform their counterparts with traditional clip features. Moreover, despite being trained exclusively on simulated videos, our approach successfully handles real-world videos from HouseTours and Ego4D, and achieves state-of-the-art results on the Ego4DNLQ challenge.