Goto

Collaborating Authors

 unseen




VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

Neural Information Processing Systems

Benefiting from language flexibility and compositionality, humans naturally intend to use language to command an embodied agent for complex tasks such as navigation and object manipulation. In this work, we aim to fill the blank of the last mile of embodied agents--object manipulation by following human guidance, e.g., "move the red mug next to the box while keeping it upright." To this end, we introduce an Automatic Manipulation Solver (AMSolver) system and build a Vision-and-Language Manipulation benchmark (VLMbench) based on it, containing various language instructions on categorized robotic manipulation tasks. Specifically, modular rule-based task templates are created to automatically generate robot demonstrations with language instructions, consisting of diverse object shapes and appearances, action types, and motion constraints. We also develop a keypoint-based model 6D-CLIPort to deal with multi-view observations and language input and output a sequence of 6 degrees of freedom (DoF) actions. We hope the new simulator and benchmark will facilitate future research on language-guided robotic manipulation.


Counterfactual Vision-and-Language Navigation: Unravelling the Unseen

Neural Information Processing Systems

The task of vision-and-language navigation (VLN) requires an agent to follow text instructions to find its way through simulated household environments. A prominent challenge is to train an agent capable of generalising to new environments at test time, rather than one that simply memorises trajectories and visual details observed during training. We propose a new learning strategy that learns both from observations and generated counterfactual environments. We describe an effective algorithm to generate counterfactual observations on the fly for VLN, as linear combinations of existing environments. Simultaneously, we encourage the agent's actions to remain stable between original and counterfactual environments through our novel training objective-effectively removing the spurious features that otherwise bias the agent. Our experiments show that this technique provides significant improvements in generalisation on benchmarks for Room-to-Room navigation and Embodied Question Answering.


UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective

Xu, Furui, Wang, Shaobo, Zhang, Jiajun, Sun, Chenghao, Tang, Haixiang, Zhang, Linfeng

arXiv.org Artificial Intelligence

The growing scale of datasets in deep learning has introduced significant computational challenges. Dataset pruning addresses this challenge by constructing a compact but informative coreset from the full dataset with comparable performance. Previous approaches typically establish scoring metrics based on specific criteria to identify representative samples. However, these methods predominantly rely on sample scores obtained from the model's performance during the training (i.e., fitting) phase. As scoring models achieve near-optimal performance on training data, such fitting-centric approaches induce a dense distribution of sample scores within a narrow numerical range. This concentration reduces the distinction between samples and hinders effective selection. To address this challenge, we conduct dataset pruning from the perspective of generalization, i.e., scoring samples based on models not exposed to them during training. We propose a plug-and-play framework, UNSEEN, which can be integrated into existing dataset pruning methods. Additionally, conventional score-based methods are single-step and rely on models trained solely on the complete dataset, providing limited perspective on the importance of samples. To address this limitation, we scale UNSEEN to multi-step scenarios and propose an incremental selection technique through scoring models trained on varying coresets, and optimize the quality of the coreset dynamically. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art (SOTA) methods on CIFAR-10, CIFAR-100, and ImageNet-1K. Notably, on ImageNet-1K, UNSEEN achieves lossless performance while reducing training data by 30\%.


Rethinking what Matters: Effective and Robust Multilingual Realignment for Low-Resource Languages

Nguyen, Quang Phuoc, Anugraha, David, Gaschi, Felix, Cheng, Jun Bin, Lee, En-Shiun Annie

arXiv.org Artificial Intelligence

Realignment is a promising strategy to improve cross-lingual transfer in multilingual language models. However, empirical results are mixed and often unreliable, particularly for typologically distant or low-resource languages (LRLs) compared to English. Moreover, word realignment tools often rely on high-quality parallel data, which can be scarce or noisy for many LRLs. In this work, we conduct an extensive empirical study to investigate whether realignment truly benefits from using all available languages, or if strategically selected subsets can offer comparable or even improved cross-lingual transfer, and study the impact on LRLs. Our controlled experiments show that realignment can be particularly effective for LRLs and that using carefully selected, linguistically diverse subsets can match full multilingual alignment, and even outperform it for unseen LRLs. This indicates that effective realignment does not require exhaustive language coverage and can reduce data collection overhead, while remaining both efficient and robust when guided by informed language selection.


UniCoD: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning

Zhang, Jianke, Hu, Yucheng, Guo, Yanjiang, Chen, Xiaoyu, Liu, Yichen, Chen, Wenna, Lu, Chaochao, Chen, Jianyu

arXiv.org Artificial Intelligence

Building generalist robot policies that can handle diverse tasks in open-ended environments is a central challenge in robotics. To leverage knowledge from large-scale pretraining, prior work (VLA) has typically built generalist policies either on top of vision-language understanding models (VLMs) or generative models. However, both semantic understanding from vision-language pretraining and visual dynamics modeling from visual-generation pretraining are crucial for embodied robots. Recent unified models of generation and understanding have demonstrated strong capabilities in both comprehension and generation through large-scale pretraining. We posit that robotic policy learning can likewise benefit from the combined strengths of understanding, planning, and continuous future representation learning. Building on this insight, we introduce UniCoD, which acquires the ability to dynamically model high-dimensional visual features through pretraining on over 1M internet-scale instructional manipulation videos. Subsequently, UniCoD is fine-tuned on data collected from the robot embodiment, enabling the learning of mappings from predictive representations to action tokens. Extensive experiments show our approach consistently outperforms baseline methods in terms of 9\% and 12\% across simulation environments and real-world out-of-distribution tasks.


AT ask Details

Neural Information Processing Systems

Table 5: All task variations except shape used in VLMbench. Table 6: All object models used in VLMbench. Object type Number of classes Classes Basic model 3 cube (1), triangular prism (1), cylinder (1)Special model 9 star (1), moon (1), cross (1), flower (1), letter't' (1), pencil (1), basket (1), box container(1), shape sorter (1)Planar model 6 rectangle (1), circle (1), triangle (1), star (1), cross (1), flower (1)Functional model 2 mug (6), sponge (1) Articulated model 2 door with one rotatble handle (2), cabinet with three vertical drawers (3) In the VLMbench, we show eight task categories:"Pick & Place objects", "Stack objects", "Drop When building an instance-level task with one variation, the other variations will also randomly change. For example, in the demonstrations of "Pick & Place objects" In the dataset, we have five types of objects, shown in Table 6. Visualizations can be found on the project website. The object can be placed anywhere with any orientation inside the container. When the detector is triggered, the task considers a success. Instruction T emplates: High-level instructions: "Pick up [target object description] and place it into [target container description]."; Low-level instructions: ("Move to the top of [target object "Move the object into [target container description]; V ariations and scene settings: All objects are randomly changing colors, size, and positions in each demonstration. Color: There are two same-shape objects and two same-shape containers in the scene initialization. All colors are randomly sampled from the color library. The object description is "[color] object"; The container description is "[color] container." Size: There are two same-shape objects and two same-shape containers in the scene initialization. One object and one container are randomly magnified while others are randomly shrunk. Relative Position: There are two same-shape objects and two same-shape containers in the scene initialization. The object description is "[front/rear/left/right] object"; The container description The number of objects varies from two to the length of the object library. High-level instructions: "Stack [below object description] and [above object Low-level instructions: ("Move to the top of [above object description]; "Move the object on [below object description]; Release the Object models: In the seen settings, five object models: star, triangular, cylinder, cube, moon.


VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

Neural Information Processing Systems

Benefiting from language flexibility and compositionality, humans naturally intend to use language to command an embodied agent for complex tasks such as navigation and object manipulation. In this work, we aim to fill the blank of the last mile of embodied agents--object manipulation by following human guidance, e.g., "move the red mug next to the box while keeping it upright." To this end, we introduce an Automatic Manipulation Solver (AMSolver) system and build a Vision-and-Language Manipulation benchmark (VLMbench) based on it, containing various language instructions on categorized robotic manipulation tasks. Specifically, modular rule-based task templates are created to automatically generate robot demonstrations with language instructions, consisting of diverse object shapes and appearances, action types, and motion constraints. We also develop a keypoint-based model 6D-CLIPort to deal with multi-view observations and language input and output a sequence of 6 degrees of freedom (DoF) actions. We hope the new simulator and benchmark will facilitate future research on language-guided robotic manipulation.


Estimating the Unseen: Improved Estimators for Entropy and other Properties

Neural Information Processing Systems

Recently, [Valiant and Valiant] showed that a class of distributional properties, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a SUBLINEAR sized sample. Specifically, given a sample consisting of independent draws from any distribution over at most n distinct elements, these properties can be estimated accurately using a sample of size O(n / log n). We propose a novel modification of this approach and show: 1) theoretically, our estimator is optimal (to constant factors, over worst-case instances), and 2) in practice, it performs exceptionally well for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. Perhaps unsurprisingly, the key step in this approach is to first use the sample to characterize the unseen portion of the distribution. This goes beyond such tools as the Good-Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: we seek to estimate the shapeof the unobserved portion of the distribution. This approach is robust, general, and theoretically principled; we expect that it may be fruitfully used as a component within larger machine learning and data analysis systems.