Goto

Collaborating Authors

 right side


Bipartite Stochastic Block Models with Tiny Clusters

Stefan Neumann

Neural Information Processing Systems

Discovering clusters in bipartite graphs has been researched in many different settings. However, most of these algorithms were heuristics and do not provide theoretical guarantees for the quality oftheir results.



Modelling and unsupervised learning of symmetric deformable object categories

James Thewlis, Hakan Bilen, Andrea Vedaldi

Neural Information Processing Systems

Top: inputimageswiththeaxisof symmetry superimposed (showningreen). Infact,ourmethodbuildson[38]and also learns a dense geometric embedding for objects, however, by using a different supervision principle,symmetry.





VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

Wang, Kangrui, Zhang, Pingyue, Wang, Zihan, Gao, Yaning, Li, Linjie, Wang, Qineng, Chen, Hanyang, Wan, Chi, Lu, Yiping, Yang, Zhengyuan, Wang, Lijuan, Krishna, Ranjay, Wu, Jiajun, Fei-Fei, Li, Choi, Yejin, Li, Manling

arXiv.org Artificial Intelligence

A key challenge in training Vision-Language Model (VLM) agents, compared to Language Model (LLM) agents, lies in the shift from textual states to complex visual observations. This transition introduces partial observability and demands robust world modeling. We ask: Can VLM agents construct internal world models through explicit visual state reasoning? To address this question, we architecturally enforce and reward the agent's reasoning process via reinforcement learning (RL), formulating it as a Partially Observable Markov Decision Process (POMDP). We find that decomposing the agent's reasoning into State Estimation ("what is the current state?") and Transition Modeling ("what comes next?") is critical for success, as demonstrated through five reasoning strategies. Our investigation into how agents represent internal beliefs reveals that the optimal representation is task-dependent: Natural Language excels at capturing semantic relationships in general tasks, while Structured formats are indispensable for precise manipulation and control. Building on these insights, we design a World Modeling Reward that provides dense, turn-level supervision for accurate state prediction, and introduce Bi-Level General Advantage Estimation (Bi-Level GAE) for turn-aware credit assignment. Through this form of visual state reasoning, a 3B-parameter model achieves a score of 0.82 across five diverse agent benchmarks, representing a 3$\times$ improvement over its untrained counterpart (0.21) and outperforming proprietary reasoning models such as GPT-5 (0.75), Gemini 2.5 Pro (0.67) and Claude 4.5 (0.62). All experiments are conducted within our VAGEN framework, a scalable system for training and analyzing multi-turn VLM agents in diverse visual environments. Code and data are publicly available at https://vagen-ai.github.io.



HumanoidVerse: A Versatile Humanoid for Vision-Language Guided Multi-Object Rearrangement

Zhang, Haozhuo, Sun, Jingkai, Caprio, Michele, Tang, Jian, Zhang, Shanghang, Zhang, Qiang, Pan, Wei

arXiv.org Artificial Intelligence

We introduce HumanoidVerse, a novel framework for vision-language guided humanoid control that enables a single physically simulated robot to perform long-horizon, multi-object rearrangement tasks across diverse scenes. Unlike prior methods that operate in fixed settings with single-object interactions, our approach supports consecutive manipulation of multiple objects, guided only by natural language instructions and egocentric camera RGB observations. HumanoidVerse is trained via a multi-stage curriculum using a dual-teacher distillation pipeline, enabling fluid transitions between sub-tasks without requiring environment resets. To support this, we construct a large-scale dataset comprising 350 multi-object tasks spanning four room layouts. Extensive experiments in the Isaac Gym simulator demonstrate that our method significantly outperforms prior state-of-the-art in both task success rate and spatial precision, and generalizes well to unseen environments and instructions. Our work represents a key step toward robust, general-purpose humanoid agents capable of executing complex, sequential tasks under real-world sensory constraints. The video visualization results can be found on the project page: https://haozhuo-zhang.github.io/HumanoidVerse-project-page/.