Goto

Collaborating Authors

 mug


Multi-agent Undercover Gaming: Hallucination Removal via Counterfactual Test for Multimodal Reasoning

Liang, Dayong, Wei, Xiao-Yong, Zheng, Changmeng

arXiv.org Artificial Intelligence

Hallucination continues to pose a major obstacle in the reasoning capabilities of large language models (LLMs). Although the Multi-Agent Debate (MAD) paradigm offers a promising solution by promoting consensus among multiple agents to enhance reliability, it relies on the unrealistic assumption that all debaters are rational and reflective, which is a condition that may not hold when agents themselves are prone to hallucinations. To address this gap, we introduce the Multi-agent Undercover Gaming (MUG) protocol, inspired by social deduction games like "Who is Undercover?". MUG reframes MAD as a process of detecting "undercover" agents (those suffering from hallucinations) by employing multimodal counterfactual tests. Specifically, we modify reference images to introduce counterfactual evidence and observe whether agents can accurately identify these changes, providing ground-truth for identifying hallucinating agents and enabling robust, crowd-powered multimodal reasoning. MUG advances MAD protocols along three key dimensions: (1) enabling factual verification beyond statistical consensus through counterfactual testing; (2) introducing cross-evidence reasoning via dynamically modified evidence sources instead of relying on static inputs; and (3) fostering active reasoning, where agents engage in probing discussions rather than passively answering questions. Collectively, these innovations offer a more reliable and effective framework for multimodal reasoning in LLMs. The source code can be accessed at https://github.com/YongLD/MUG.git.


A Appendix A.1 Additional Method Justification

Neural Information Processing Systems

This problem has been studied in stochastic optimal control, particularly REPS [Peters et al., 2010]. In our experiments, we use soft actor-critic [Haarnoja et al., 2018] as our base RL algorithm. The policy and critic networks are MLPs with 2 fully-connected hidden layers of size 256. Following [Sharma et al., 2021b], we use a biased TD update, where For all experiments using prior data collected through RL, the agent was initialized at test time with the pretrained policy and critic. The details for this environment are in [Sharma et al., 2021b].


Our Favorite Travel and Outdoor Gear Is on Sale at Huckberry

WIRED

Huckberry's eclectic curation of travel clothing, coffee gear, and backpacks are all on sale right now. Huckberry, purveyor of finely curated clothing and gear for the sort of person equally at home in the woods and the city, is having one of the company's rare site-wide sales this week--or pretty close to site-wide. We've tested and love quite a bit of Huckberry's stuff, especially the Proof 72-hour merino T-shirt . If you buy nothing else this year, buy that. Check out the other deals, which we've rounded up below.



Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents

Gao, Heyang, Sun, Zexu, Min, Erxue, Cai, Hengyi, Wang, Shuaiqiang, Yin, Dawei, Chen, Xu

arXiv.org Artificial Intelligence

Large Language Models (LLMs) as autonomous agents are increasingly tasked with solving complex, long-horizon problems. Aligning these agents via preference-based offline methods like Direct Preference Optimization (DPO) is a promising direction, yet it faces a critical granularity mismatch. Trajectory-level DPO provides a signal that is too coarse for precise credit assignment, while step-level DPO is often too myopic to capture the value of multi-step behaviors. To resolve this challenge, we introduce Hierarchical Preference Learning (HPL), a hierarchical framework that optimizes LLM agents by leveraging preference signals at multiple, synergistic granularities. While HPL incorporates trajectory- and step-level DPO for global and local policy stability, its core innovation lies in group-level preference optimization guided by a dual-layer curriculum. Our approach first decomposes expert trajectories into semantically coherent action groups and then generates contrasting suboptimal groups to enable preference learning at a fine-grained, sub-task level. Then, instead of treating all preference pairs equally, HPL introduces a curriculum scheduler that organizes the learning process from simple to complex. This curriculum is structured along two axes: the group length, representing sub-task complexity, and the sample difficulty, defined by the reward gap between preferred and dispreferred action groups. Experiments on three challenging agent benchmarks show that HPL outperforms existing state-of-the-art methods. Our analyses demonstrate that the hierarchical DPO loss effectively integrates preference signals across multiple granularities, while the dual-layer curriculum is crucial for enabling the agent to solve a wide range of tasks, from simple behaviors to complex multi-step sequences.


AT ask Details

Neural Information Processing Systems

Table 5: All task variations except shape used in VLMbench. Table 6: All object models used in VLMbench. Object type Number of classes Classes Basic model 3 cube (1), triangular prism (1), cylinder (1)Special model 9 star (1), moon (1), cross (1), flower (1), letter't' (1), pencil (1), basket (1), box container(1), shape sorter (1)Planar model 6 rectangle (1), circle (1), triangle (1), star (1), cross (1), flower (1)Functional model 2 mug (6), sponge (1) Articulated model 2 door with one rotatble handle (2), cabinet with three vertical drawers (3) In the VLMbench, we show eight task categories:"Pick & Place objects", "Stack objects", "Drop When building an instance-level task with one variation, the other variations will also randomly change. For example, in the demonstrations of "Pick & Place objects" In the dataset, we have five types of objects, shown in Table 6. Visualizations can be found on the project website. The object can be placed anywhere with any orientation inside the container. When the detector is triggered, the task considers a success. Instruction T emplates: High-level instructions: "Pick up [target object description] and place it into [target container description]."; Low-level instructions: ("Move to the top of [target object "Move the object into [target container description]; V ariations and scene settings: All objects are randomly changing colors, size, and positions in each demonstration. Color: There are two same-shape objects and two same-shape containers in the scene initialization. All colors are randomly sampled from the color library. The object description is "[color] object"; The container description is "[color] container." Size: There are two same-shape objects and two same-shape containers in the scene initialization. One object and one container are randomly magnified while others are randomly shrunk. Relative Position: There are two same-shape objects and two same-shape containers in the scene initialization. The object description is "[front/rear/left/right] object"; The container description The number of objects varies from two to the length of the object library. High-level instructions: "Stack [below object description] and [above object Low-level instructions: ("Move to the top of [above object description]; "Move the object on [below object description]; Release the Object models: In the seen settings, five object models: star, triangular, cylinder, cube, moon.


ImaginationPolicy: Towards Generalizable, Precise and Reliable End-to-End Policy for Robotic Manipulation

Lu, Dekun, Gao, Wei, Jia, Kui

arXiv.org Artificial Intelligence

End-to-end robot manipulation policies offer significant potential for enabling embodied agents to understand and interact with the world. Unlike traditional modular pipelines, end-to-end learning mitigates key limitations such as information loss between modules and feature misalignment caused by isolated optimization targets. Despite these advantages, existing end-to-end neural networks for robotic manipulation--including those based on large VLM/VLA models--remain insufficiently performant for large-scale practical deployment. In this paper, we take a step towards an end-to-end manipulation policy that is generalizable, accurate and reliable. To achieve this goal, we propose a novel Chain of Moving Oriented Keypoints (CoMOK) formulation for robotic manipulation. Our formulation is used as the action representation of a neural policy, which can be trained in an end-to-end fashion. Such an action representation is general, as it extends the standard end-effector pose action representation and supports a diverse set of manipulation tasks in a unified manner. The oriented keypoint in our method enables natural generalization to objects with different shapes and sizes, while achieving sub-centimeter accuracy. Moreover, our formulation can easily handle multi-stage tasks, multi-modal robot behaviors, and deformable objects. Extensive simulated and hardware experiments demonstrate the effectiveness of our method.


A Appendix A.1 Additional Method Justification The key idea of Q

Neural Information Processing Systems

This problem has been studied in stochastic optimal control, particularly REPS [Peters et al., 2010]. In our experiments, we use soft actor-critic [Haarnoja et al., 2018] as our base RL algorithm. The policy and critic networks are MLPs with 2 fully-connected hidden layers of size 256. Following [Sharma et al., 2021b], we use a biased TD update, where For all experiments using prior data collected through RL, the agent was initialized at test time with the pretrained policy and critic. The details for this environment are in [Sharma et al., 2021b].



Point2Act: Efficient 3D Distillation of Multimodal LLMs for Zero-Shot Context-Aware Grasping

Kim, Sang Min, Heo, Hyeongjun, Kim, Junho, Lee, Yonghyeon, Kim, Young Min

arXiv.org Artificial Intelligence

We propose Point2Act, which directly retrieves the 3D action point relevant for a contextually described task, leveraging Multimodal Large Language Models (MLLMs). Foundation models opened the possibility for generalist robots that can perform a zero-shot task following natural language descriptions within an unseen environment. While the semantics obtained from large-scale image and language datasets provide contextual understanding in 2D images, the rich yet nuanced features deduce blurry 2D regions and struggle to find precise 3D locations for actions. Our proposed 3D relevancy fields bypass the high-dimensional features and instead efficiently imbue lightweight 2D point-level guidance tailored to the task-specific action. The multi-view aggregation effectively compensates for misalignments due to geometric ambiguities, such as occlusion, or semantic uncertainties inherent in the language descriptions. The output region is highly localized, reasoning fine-grained 3D spatial context that can directly transfer to an explicit position for physical action at the on-the-fly reconstruction of the scene. Our full-stack pipeline, which includes capturing, MLLM querying, 3D reconstruction, and grasp pose extraction, generates spatially grounded responses in under 20 seconds, facilitating practical manipulation tasks. Project page: https://sangminkim-99.github.io/point2act/