sponge
Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation
Zhang, Yihao, Qi, Yuankai, Zheng, Xi
Foundation models applied in robotics, particularly \textbf{Vision--Language--Action (VLA)} models, hold great promise for achieving general-purpose manipulation. Yet, systematic real-world evaluations and cross-model comparisons remain scarce. This paper reports our \textbf{empirical experiences} from benchmarking four representative VLAs -- \textbf{ACT}, \textbf{OpenVLA--OFT}, \textbf{RDT-1B}, and \boldmath{$π_0$} -- across four manipulation tasks conducted in both simulation and on the \textbf{ALOHA Mobile} platform. We establish a \textbf{standardized evaluation framework} that measures performance along three key dimensions: (1) \textit{accuracy and efficiency} (success rate and time-to-success), (2) \textit{adaptability} across in-distribution, spatial out-of-distribution, and instance-plus-spatial out-of-distribution settings, and (3) \textit{language instruction-following accuracy}. Through this process, we observe that \boldmath{$π_0$} demonstrates superior adaptability in out-of-distribution scenarios, while \textbf{ACT} provides the highest stability in-distribution. Further analysis highlights differences in computational demands, data-scaling behavior, and recurring failure modes such as near-miss grasps, premature releases, and long-horizon state drift. These findings reveal practical trade-offs among VLA model architectures in balancing precision, generalization, and deployment cost, offering actionable insights for selecting and deploying VLAs in real-world robotic manipulation tasks.
- Oceania > Australia > New South Wales > Sydney (0.04)
- Asia > South Korea > Daegu > Daegu (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.34)
Prescribed Performance Control of Deformable Object Manipulation in Spatial Latent Space
Han, Ning, Gong, Gu, Zhang, Bin, Xu, Yuexuan, Yang, Bohan, Liu, Yunhui, Navarro-Alarcon, David
Manipulating three-dimensional (3D) deformable objects presents significant challenges for robotic systems due to their infinite-dimensional state space and complex deformable dynamics. This paper proposes a novel model-free approach for shape control with constraints imposed on key points. Unlike existing methods that rely on feature dimensionality reduction, the proposed controller leverages the coordinates of key points as the feature vector, which are extracted from the deformable object's point cloud using deep learning methods. This approach not only reduces the dimensionality of the feature space but also retains the spatial information of the object. By extracting key points, the manipulation of deformable objects is simplified into a visual servoing problem, where the shape dynamics are described using a deformation Jacobian matrix. To enhance control accuracy, a prescribed performance control method is developed by integrating barrier Lyapunov functions (BLF) to enforce constraints on the key points. The stability of the closed-loop system is rigorously analyzed and verified using the Lyapunov method. Experimental results further demonstrate the effectiveness and robustness of the proposed method.
- North America > United States (0.04)
- Asia > China > Hong Kong > Kowloon (0.04)
AT ask Details
Table 5: All task variations except shape used in VLMbench. Table 6: All object models used in VLMbench. Object type Number of classes Classes Basic model 3 cube (1), triangular prism (1), cylinder (1)Special model 9 star (1), moon (1), cross (1), flower (1), letter't' (1), pencil (1), basket (1), box container(1), shape sorter (1)Planar model 6 rectangle (1), circle (1), triangle (1), star (1), cross (1), flower (1)Functional model 2 mug (6), sponge (1) Articulated model 2 door with one rotatble handle (2), cabinet with three vertical drawers (3) In the VLMbench, we show eight task categories:"Pick & Place objects", "Stack objects", "Drop When building an instance-level task with one variation, the other variations will also randomly change. For example, in the demonstrations of "Pick & Place objects" In the dataset, we have five types of objects, shown in Table 6. Visualizations can be found on the project website. The object can be placed anywhere with any orientation inside the container. When the detector is triggered, the task considers a success. Instruction T emplates: High-level instructions: "Pick up [target object description] and place it into [target container description]."; Low-level instructions: ("Move to the top of [target object "Move the object into [target container description]; V ariations and scene settings: All objects are randomly changing colors, size, and positions in each demonstration. Color: There are two same-shape objects and two same-shape containers in the scene initialization. All colors are randomly sampled from the color library. The object description is "[color] object"; The container description is "[color] container." Size: There are two same-shape objects and two same-shape containers in the scene initialization. One object and one container are randomly magnified while others are randomly shrunk. Relative Position: There are two same-shape objects and two same-shape containers in the scene initialization. The object description is "[front/rear/left/right] object"; The container description The number of objects varies from two to the length of the object library. High-level instructions: "Stack [below object description] and [above object Low-level instructions: ("Move to the top of [above object description]; "Move the object on [below object description]; Release the Object models: In the seen settings, five object models: star, triangular, cylinder, cube, moon.
Sea sponges may have been Earth's first living creatures
Science Biology Evolution Sea sponges may have been Earth's first living creatures The prehistoric invertebrates likely arrived about 541 million years ago. Breakthroughs, discoveries, and DIY tips sent every weekday. At a certain point in Earth's distant past, the planet's assortment of organic molecules and compounds aligned to create the very first living organisms . The identity of these first living things still continues to elude evolutionary biologists. However, a team of MIT geochemists believe that some 541 million-year-old chemical fossils embedded in sediment indicate that some of Earth's earliest creatures were the ancient relatives of today's sea sponges .
- Asia > Middle East > Oman (0.06)
- Asia > India (0.05)
VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback
Bi, Jianxin, Ma, Kevin Yuchen, Hao, Ce, Shou, Mike Zheng, Soh, Harold
Tactile feedback is generally recognized to be crucial for effective interaction with the physical world. However, state-of-the-art Vision-Language-Action (VLA) models lack the ability to interpret and use tactile signals, limiting their effectiveness in contact-rich tasks. Incorporating tactile feedback into these systems is challenging due to the absence of large multi-modal datasets. We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing \emph{without fine-tuning} the base VLA. Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation. Through real-world experiments, we demonstrate that our dual-level integration of tactile feedback improves task planning efficiency while enhancing execution precision. Code is open-sourced at \href{https://github.com/jxbi1010/VLA-Touch}{this URL}.
- Europe > Netherlands > South Holland > Delft (0.04)
- Asia > Singapore (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
Bi-LAT: Bilateral Control-Based Imitation Learning via Natural Language and Action Chunking with Transformers
Kobayashi, Takumi, Kobayashi, Masato, Buamanee, Thanpimon, Uranishi, Yuki
-- We present Bi-LA T, a novel imitation learning framework that unifies bilateral control with natural language processing to achieve precise force modulation in robotic manipulation. Bi-LA T leverages joint position, velocity, and torque data from leader-follower teleoperation while also integrating visual and linguistic cues to dynamically adjust applied force. By encoding human instructions such as "softly grasp the cup" or "strongly twist the sponge" through a multimodal Transformer-based model, Bi-LA T learns to distinguish nuanced force requirements in real-world tasks. We demonstrate Bi-LA T's performance in (1) unimanual cup-stacking scenario where the robot accurately modulates grasp force based on language commands, and (2) bimanual sponge-twisting task that requires coordinated force control. Experimental results show that Bi-LA T effectively reproduces the instructed force levels, particularly when incorporating SigLIP among tested language encoders. Our findings demonstrate the potential of integrating natural language cues into imitation learning, paving the way for more intuitive and adaptive human-robot interaction. I. INTRODUCTION In today's rapidly evolving landscape of robotics, integrating advanced manipulation capabilities with social intelligence is pivotal to shaping our hybrid future.
- Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.05)
- North America > United States (0.04)
A Human-in-the-loop Approach to Robot Action Replanning through LLM Common-Sense Reasoning
Merlo, Elena, Lagomarsino, Marta, Ajoudani, Arash
To facilitate the wider adoption of robotics, accessible programming tools are required for non-experts. Observational learning enables intuitive human skills transfer through hands-on demonstrations, but relying solely on visual input can be inefficient in terms of scalability and failure mitigation, especially when based on a single demonstration. This paper presents a human-in-the-loop method for enhancing the robot execution plan, automatically generated based on a single RGB video, with natural language input to a Large Language Model (LLM). By including user-specified goals or critical task aspects and exploiting the LLM common-sense reasoning, the system adjusts the vision-based plan to prevent potential failures and adapts it based on the received instructions. Experiments demonstrated the framework intuitiveness and effectiveness in correcting vision-derived errors and adapting plans without requiring additional demonstrations. Moreover, interactive plan refinement and hallucination corrections promoted system robustness.
Adaptive Wiping: Adaptive contact-rich manipulation through few-shot imitation learning with Force-Torque feedback and pre-trained object representations
Tsuji, Chikaha, Coronado, Enrique, Osorio, Pablo, Venture, Gentiane
Imitation learning offers a pathway for robots to perform repetitive tasks, allowing humans to focus on more engaging and meaningful activities. However, challenges arise from the need for extensive demonstrations and the disparity between training and real-world environments. This paper focuses on contact-rich tasks like wiping with soft and deformable objects, requiring adaptive force control to handle variations in wiping surface height and the sponge's physical properties. To address these challenges, we propose a novel method that integrates real-time force-torque (FT) feedback with pre-trained object representations. This approach allows robots to dynamically adjust to previously unseen changes in surface heights and sponges' physical properties. In real-world experiments, our method achieved 96% accuracy in applying reference forces, significantly outperforming the previous method that lacked an FT feedback loop, which only achieved 4% accuracy. To evaluate the adaptability of our approach, we conducted experiments under different conditions from the training setup, involving 40 scenarios using 10 sponges with varying physical properties and 4 types of wiping surface heights, demonstrating significant improvements in the robot's adaptability by analyzing force trajectories. The video of our work is available at: https://sites.google.com/view/adaptive-wiping
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States (0.04)
- Asia > Japan > Honshū > Kansai > Hyogo Prefecture > Kobe (0.04)
VACT: A Video Automatic Causal Testing System and a Benchmark
Yang, Haotong, Zheng, Qingyuan, Gao, Yunjian, Yang, Yongkun, He, Yangbo, Lin, Zhouchen, Zhang, Muhan
With the rapid advancement of text-conditioned Video Generation Models (VGMs), the quality of generated videos has significantly improved, bringing these models closer to functioning as ``*world simulators*'' and making real-world-level video generation more accessible and cost-effective. However, the generated videos often contain factual inaccuracies and lack understanding of fundamental physical laws. While some previous studies have highlighted this issue in limited domains through manual analysis, a comprehensive solution has not yet been established, primarily due to the absence of a generalized, automated approach for modeling and assessing the causal reasoning of these models across diverse scenarios. To address this gap, we propose VACT: an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios. By combining causal analysis techniques with a carefully designed large language model assistant, our system can assess the causal behavior of models in various contexts without human annotation, which offers strong generalization and scalability. Additionally, we introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs. As a demonstration, we use our framework to benchmark several prevailing VGMs, offering insight into their causal reasoning capabilities. Our work lays the foundation for systematically addressing the causal understanding deficiencies in VGMs and contributes to advancing their reliability and real-world applicability.
- Research Report > New Finding (0.45)
- Research Report > Experimental Study (0.45)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)