Goto

Collaborating Authors

 Agents


Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents

arXiv.org Artificial Intelligence

Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction un-derexplored. T o address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities: 1) Physical Interaction, 2) T emporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. T ogether, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents.


Multi-Agent Code Verification via Information Theory

arXiv.org Artificial Intelligence

LLMs generate buggy code: 29.6% of SWE-bench solved patches fail, 62% of BaxBench solutions have vulnerabilities, and existing tools only catch 65% of bugs with 35% false positives. We built CodeX-Verify, a multi-agent system that uses four specialized agents to detect different types of bugs. We prove mathematically that combining agents with different detection patterns finds more bugs than any single agent when the agents look for different problems, using submodularity of mutual information under conditional independence. Measuring agent correlation of rho = 0.05 to 0.25 confirms they detect different bugs. Testing on 99 code samples with verified labels shows our system catches 76.1% of bugs, matching the best existing method (Meta Prompt Testing: 75%) while running faster and without test execution. We tested all 15 agent combinations and found that using multiple agents improves accuracy by 39.7 percentage points (from 32.8% to 72.4%) compared to single agents, with diminishing returns of +14.9pp, +13.5pp, and +11.2pp for agents 2, 3, and 4, validating our theoretical model. The best two-agent combination (Correctness + Performance) reaches 79.3% accuracy. Testing on 300 real patches from Claude Sonnet 4.5 runs in under 200ms per sample, making this practical for production use.


AudAgent: Automated Auditing of Privacy Policy Compliance in AI Agents

arXiv.org Artificial Intelligence

AI agents can autonomously perform tasks and, often without explicit user consent, collect or disclose users' sensitive local data, which raises serious privacy concerns. Although AI agents' privacy policies describe their intended data practices, there remains limited transparency and accountability about whether runtime behavior matches those policies. To close this gap, we introduce AudAgent, a visual tool that continuously monitors AI agents' data practices in real time and guards compliance with stated privacy policies. AudAgent consists of four components for automated privacy auditing of AI agents. (i) Policy formalization: a novel cross-LLM voting mechanism to guarantee confidence of the parsed privacy policy model. (ii) Runtime annotation: a lightweight Presidio-based analyzer detects sensitive data and annotates data practices based on the AI agent's context and the privacy policy model. (iii) Compliance auditing: ontology graphs and automata-based checking connect the privacy policy model with runtime annotations, enabling on-the-fly compliance checking. (iv) User interface: an infrastructure-independent implementation visualizes the real-time execution trace of AI agents along with potential privacy policy violations, providing user-friendly transparency and accountability. We evaluate AudAgent with AI agents built using mainstream frameworks, demonstrating its effectiveness in detecting and visualizing privacy policy violations in real time. Using AudAgent, we also find that most privacy policies omit explicit safeguards for highly sensitive data such as SSNs, whose misuse violates legal requirements, and that many agents do not refuse handling such data via third-party tools, including those controlled by Claude, Gemini, and DeepSeek. AudAgent proactively blocks operations on such data, overriding the agents' original privacy policy and behavior.


Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents

arXiv.org Artificial Intelligence

Detecting vulnerabilities in source code remains a critical yet challenging task, especially when benign and vulnerable functions share significant similarities. In this work, we introduce VulTrial, a courtroom-inspired multi-agent framework designed to identify vulnerable code and to provide explanations. It employs four role-specific agents, which are security researcher, code author, moderator, and review board. Using GPT-4o as the base LLM, VulTrial almost doubles the efficacy of prior best-performing baselines. Additionally, we show that role-specific instruction tuning with small quantities of data significantly further boosts VulTrial's efficacy. Our extensive experiments demonstrate the efficacy of VulTrial across different LLMs, including an open-source, in-house-deployable model (LLaMA-3.1-8B), as well as the high quality of its generated explanations and its ability to uncover multiple confirmed zero-day vulnerabilities in the wild.


What Happens When Your Coworkers Are AI Agents

WIRED

In this episode of, we talk to writer Evan Ratliff about how he created a small startup made entirely of AI employees--and what his findings reveal about the reality of an agentic future. This year, AI agents have been at the forefront of tech companies' ambitions. OpenAI's Sam Altman has often talked about a possible billion-dollar company being spun up with just one human and an army of AI agents. And so last summer, journalist Evan Ratliff decided to try to become that unicorn himself--by creating HarumoAI, a small startup that's made up of AI employees and executives. Hosts Michael Calore and Lauren Goode sit down with Evan to discuss how it's going, and the current promises and realities of AI agents. Write to us at uncannyvalley@wired.com . You can always listen to this week's podcast through the audio player on this page, but if you want to subscribe for free to get every episode, here's how: If you're on an iPhone or iPad, open the app called Podcasts, or just tap this link . Hey, Lauren, how are you doing? It was so fantastic that I had a hard time coming back, honestly. And I saw a lot of really beautiful art. Not a bad place to go for vacation, I have to say. I've heard this before, I confirmed it. And after seeing so much incredible art and just people doing stuff with their hands and tangible goods, I was like, I don't want to go back to the world of AI. I didn't want to go back to sitting in a coffee shop and hearing everyone pitching their AI startups and driving on the 101 and seeing the billboards. I was just like, What? No, keep me in the land of Burrata and Caravaggio. Well, Lauren, I'm sorry to tell you that you came back on the show just in time to talk about AI agents. It's something that we've talked about a lot this year and our listeners have heard about it a lot, and we're not sick of talking about it.


Teaching robot policies without new demonstrations: interview with Jiahui Zhang and Jesse Zhang

Robohub

In their paper ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations, which was presented at CoRL 2025, and introduce a framework for learning robot manipulation tasks solely from language instructions without per-task demonstrations. We asked Jiahui Zhang and Jesse Zhang to tell us more. What is the topic of the research in your paper, and what problem were you aiming to solve? Our research addresses the problem of enabling robot manipulation policies to solve novel, language-conditioned tasks without collecting new demonstrations for each task. We begin with a small set of demonstrations in the deployment environment, train a language-conditioned reward model on them, and then use that learned reward function to fine-tune the policy on unseen tasks, with no additional demonstrations required.


E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

arXiv.org Machine Learning

Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.


Local Dominance in Mixed-Strength Populations -- Fast Maximal Independent Set

arXiv.org Artificial Intelligence

In many natural and engineered systems, agents interact through local contests that determine which individuals become dominant within their neighborhoods. These interactions are shaped by inherent differences in strength, and they often lead to stable dominance patterns that emerge surprisingly quickly relative to the size of the population. This motivates the search for simple mathematical models that capture both heterogeneous agent strength and rapid convergence to stable local dominance. A widely studied abstraction of local dominance is the Maximal Independent Set (MIS) problem. In the Luby MIS protocol that provably converges quickly to an MIS, each agent repeatedly generates a strength value chosen uniformly and becomes locally dominant if its value is smaller than those of its neighbors. This provides a theoretical explanation for fast dominance convergence in populations of equal-strength agents and naturally raises the question of whether fast convergence also holds in the more realistic setting where agents are inherently mixed-strength. To investigate this question, we introduce the mixed-strength agents model, in which each agent draws its strength from its own distribution. We prove that the extension of the Luby MIS protocol where each agent repeatedly generates a strength value from its own distribution still exhibits fast dominance convergence, providing formal confirmation of the rapid convergence observed in many mixed-strength natural processes. We also show that heterogeneity can significantly change the dynamics of the process. In contrast to the equal-strength setting, a constant fraction of edges need not be eliminated per round. We construct a population and strength profile in which progress per round is asymptotically smaller, illustrating how inherent strength asymmetry produces qualitatively different global behavior.


DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation

arXiv.org Artificial Intelligence

Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency through multi-turn question-answering tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving a 15.3 percent accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal substantial robustness gaps: even advanced models such as GPT-4o achieve only a 78.5 percent recovery rate, while open-source models struggle with temporal consistency at less than 60 percent. DIQ-H provides a comprehensive platform for evaluating VLM reliability in real-world deployments.


Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol

arXiv.org Artificial Intelligence

Industrial automation increasingly requires flexible control strategies that can adapt to changing tasks and environments. Agents based on Large Language Models (LLMs) offer potential for such adaptive planning and execution but lack standardized benchmarks for systematic comparison. We introduce a benchmark with an executable simulation environment representing the Blocksworld problem providing five complexity categories. By integrating the Model Context Protocol (MCP) as a standardized tool interface, diverse agent architectures can be connected to and evaluated against the benchmark without implementation-specific modifications. A single-agent implementation demonstrates the benchmark's applicability, establishing quantitative metrics for comparison of LLM-based planning and execution approaches.