Goto

Collaborating Authors

 gemini 2


Privacy Reasoning in Ambiguous Contexts

Neural Information Processing Systems

We study the ability of language models to reason about appropriate information disclosure--a central aspect of the evolving field of agentic privacy. Whereas previous works have focused on evaluating a model's ability to align with human decisions, we examine the role of ambiguity and missing context on model performance when making information-sharing decisions. We identify context ambiguity as a crucial barrier for high performance in privacy assessments. By designing Camber, a framework for context disambiguation, we show that model-generated decision rationales can reveal ambiguities and that systematically disambiguating context based on these rationales leads to significant accuracy improvements (up to 13.3% in precision and up to 22.3% in recall) as well as reductions in prompt sensitivity. Overall, our results indicate that approaches for context disambiguation are a promising way forward to enhance agentic privacy reasoning.


ALE-Bench: ABenchmark for Long-Horizon Objective-Driven Algorithm Engineering

Neural Information Processing Systems

How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.



c42c8d51556fabb4b57fc86d3d3d0d09-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing Systems

QuestBench: acquire inf Can ormation LLMs ask in reasoning the right tasks? question to Lar ingly ge being language applied models to reasoning (LLMs) tasks are increassuch as math ning/coding tions typically [15, 34 [ , 18 46 assume , ], 59 logic , 63 all , 6 [ necessary 70 , 10 , 12 ]. Users orld scenarios may omit often crucial violate details this in in such en math cas vironme es, problems, LLMs nts with need and partial the robots ability observ might to proacti ability operate v .


AutoToM Scaling Model based Mental Inference via Automated Agent Modeling

Neural Information Processing Systems

Theory of Mind (ToM), the ability to understand people's minds based on their behavior, is key to developing socially intelligent agents. Current approaches to ToM reasoning either rely on prompting Large Language Models (LLMs), which are prone to systematic errors, or use handcrafted, rigid agent models for model-based inference, which are more robust but fail to generalize across domains. In this work, we introduce AutoToM, an automated agent modeling method for scalable, robust, and interpretable mental inference. Given a ToM problem, AutoToM first proposes an initial agent model and then performs automated Bayesian inverse planning based on this model, leveraging an LLM backend.


impacts

Neural Information Processing Systems

The primary goal of PACBench is to catalyze the development of more capable, reliable, and physically grounded VLMs and their fine-tuned variants, often called VLAs for real-world robotic applications. Because VLA fine-tuning typically relies on low-level trajectory data rather than higher level reasoning, probing the underlying VLM's understanding of object Properties, action Affordances, and physical Constraints (PAC) gives us a grounded lens into the capabilities that downstream robotic policies will inherit. By diagnosing PAC weaknesses in the base model, researchers can distinguish whether a VLA's performance stems from genuine physical common sense or simply memorized motion patterns, and thus guide targeted improvements in model architectures, training methodologies, and dataset curation. In doing so, PACBench helps ensure that robotic systems become more predictable, less prone to errors from a lack of physical understanding, and better equipped for safe, effective collaboration in complex, everyday environments. By providing a fine-grained diagnostic tool, PACBench can help researchers and developers identify specific weaknesses in current models, thereby guiding targeted improvements in model architectures, training methodologies, and dataset curation. This, in turn, can lead to robotic systems that are more predictable, less prone to errors stemming from a lack of physical common sense, and better able to perform a wide range of useful tasks. The open release of our benchmark and its diverse data sources (including web-scale images, real-world humanoid captures, and simulated scenarios) is intended to foster broad community engagement and accelerate progress in this crucial area of AI. While any advancement in AI capabilities warrants ongoing consideration of its societal implications, our work focuses on enhancing the fundamental understanding and robustness of AI systems, which we see as a positive step towards more responsible AI development.


9ecafb09de180aaad7b7205be7eb24a4-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing Systems

Vision-Language Models (VLMs) are increasingly pivotal for generalist robot manipulation, enabling tasks such as physical reasoning, policy generation, and failure detection. However, their proficiency in these high-level applications often assumes a deep understanding of low-level physical prerequisites, a capability that is largely unverified. To perform actions reliably, robots must comprehend intrinsic object properties (e.g., material, weight), action affordances (e.g., graspable, stackable), and physical constraints (e.g., stability, reachability, or an object's state like being closed). Despite their ubiquitous use in manipulation, we argue that off-the-shelf VLMs may lack this granular, physically-grounded understanding, as these specific prerequisites are often overlooked during training. Addressing this critical gap, we introduce PACBench, a comprehensive benchmark designed to systematically evaluate VLMs on their understanding of these core Properties, Affordances, and Constraints (PAC) from a task executability perspective. PAC Bench features a diverse dataset with more than 30,000 annotations, comprising 673 real-world images (115 object classes, 15 property types, 1-3 affordances defined per object class), 100 real-world humanoid view scenarios, and 120 unique simulated constraint scenarios across four tasks. Our evaluations reveal significant gaps in the ability of VLMs to grasp fundamental physical concepts, underscoring their current limitations for reliable robot manipulation and pointing to key areas that require targeted research. PACBench also serves as a standardized benchmark for rigorously evaluating the physical reasoning capabilities of VLMs guiding the development of more robust and physically grounded models for robot manipulation.


NOVA: ABenchmark for Rare Anomaly Localization and Clinical Reasoning in Brain MRI

Neural Information Processing Systems

In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Open-world recognition ensures that such systems remain robust as ever-emerging, previously unknown categories appear and must be addressed without retraining. Foundation and vision-language models are pretrained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging. However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use. We therefore present NOVA, a challenging, real-life evaluation-only benchmark of 900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. Because NOVA is neverused for training, it serves as an extreme stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and insemantic space.


LEXICON: a Benchmark for Planning under Temporal Constraints in Natural Language

Neural Information Processing Systems

Owing to their reasoning capabilities, large language models (LLMs) have been evaluated on planning tasks described in natural language. However, LLMs have largely been tested on planning domains without constraints. In order to deploy them in real-world settings where adherence to constraints, in particular safety constraints, is critical, we need to evaluate their performance on constrained planning tasks. We introduce LEXICON--a natural language-based (LEXI) constrained (CON) planning benchmark, consisting of a suite of environments, that can be used to evaluate the planning capabilities of LLMs in a principled fashion. The core idea behind LEXICON is to take existing planning environments and impose temporal constraints on the states.