Goto

Collaborating Authors

 assembly



HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding

Neural Information Processing Systems

Understanding comprehensive assembly knowledge from videos is critical for futuristic ultra-intelligent industry. To enable technological breakthrough, we present HA-ViD - the first human assembly video dataset that features representative industrial assembly scenarios, natural procedural knowledge acquisition process, and consistent human-robot shared annotations. Specifically, HA-ViD captures diverse collaboration patterns of real-world assembly, natural human behaviors and learning progression during assembly, and granulate action annotations to subject, action verb, manipulated object, target object, and tool. We provide 3222 multi-view and multi-modality videos), 1.5M frames, 96K temporal labels and 2M spatial labels. We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking. Importantly, we analyze their performance and the further reasoning steps for comprehending knowledge in assembly progress, process efficiency, task collaboration, skill parameters and human intention. Details of HA-ViD is available at: https://iai-hrc.github.io/ha-vid.


IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos

Neural Information Processing Systems

Shape assembly is a ubiquitous task in daily life, integral for constructing complex 3D structures like IKEA furniture. While significant progress has been made in developing autonomous agents for shape assembly, existing datasets have not yet tackled the 4D grounding of assembly instructions in videos, essential for a holistic understanding of assembly in 3D space over time. We introduce IKEA Video Manuals, a dataset that features 3D models of furniture parts, instructional manuals, assembly videos from the Internet, and most importantly, annotations of dense spatio-temporal alignments between these data modalities. To demonstrate the utility of IKEA Video Manuals, we present five applications essential for shape assembly: assembly plan generation, part-conditioned segmentation, part-conditioned pose estimation, video object segmentation, and furniture assembly based on instructional video manuals. For each application, we provide evaluation metrics and baseline methods. Through experiments on our annotated data, we highlight many challenges in grounding assembly instructions in videos to improve shape assembly, including handling occlusions, varying viewpoints, and extended assembly sequences.


Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly

Neural Information Processing Systems

Large language and vision models have been leading a revolution in visual computing. By greatly scaling up sizes of data and model parameters, the large models learn deep priors which lead to remarkable performance in various tasks.


IKEA-Manual: Seeing Shape Assembly Step by Step

Neural Information Processing Systems

Human-designed visual manuals are crucial components in shape assembly activities. They provide step-by-step guidance on how we should move and connect different parts in a convenient and physically-realizable way. While there has been an ongoing effort in building agents that perform assembly tasks, the information in human-design manuals has been largely overlooked. We identify that this is due to 1) a lack of realistic 3D assembly objects that have paired manuals and 2) the difficulty of extracting structured information from purely image-based manuals. Motivated by this observation, we present IKEA-Manual, a dataset consisting of 102 IKEA objects paired with assembly manuals. We provide fine-grained annotations on the IKEA objects and assembly manuals, including decomposed assembly parts, assembly plans, manual segmentation, and 2D-3D correspondence between 3D parts and visual manuals. We illustrate the broad application of our dataset on four tasks related to shape assembly: assembly plan generation, part segmentation, pose estimationand 3D part assembly.


Jigsaw: Learning to Assemble Multiple Fractured Objects

Neural Information Processing Systems

Automated assembly of 3D fractures is essential in orthopedics, archaeology, and our daily life. This paper presents Jigsaw, a novel framework for assembling physically broken 3D objects from multiple pieces. Our approach leverages hierarchical features of global and local geometry to match and align the fracture surfaces. Our framework consists of four components: (1) front-end point feature extractor with attention layers, (2) surface segmentation to separate fracture and original parts, (3) multi-parts matching to find correspondences among fracture surface points, and (4) robust global alignment to recover the global poses of the pieces. We show how to jointly learn segmentation and matching and seamlessly integrate feature matching and rigidity constraints. We evaluate Jigsaw on the Breaking Bad dataset and achieve superior performance compared to state-of-the-art methods.


ShapeForce: Low-Cost Soft Robotic Wrist for Contact-Rich Manipulation

Zhu, Jinxuan, Yan, Zihao, Xiao, Yangyu, Guo, Jingxiang, Tie, Chenrui, Cao, Xinyi, Zheng, Yuhang, Shao, Lin

arXiv.org Artificial Intelligence

Contact feedback is essential for contact-rich robotic manipulation, as it allows the robot to detect subtle interaction changes and adjust its actions accordingly. Six-axis force-torque sensors are commonly used to obtain contact feedback, but their high cost and fragility have discouraged many researchers from adopting them in contact-rich tasks. To offer a more cost-efficient and easy-accessible source of contact feedback, we present ShapeForce, a low-cost, plug-and-play soft wrist that provides force-like signals for contact-rich robotic manipulation. Inspired by how humans rely on relative force changes in contact rather than precise force magnitudes, ShapeForce converts external force and torque into measurable deformations of its compliant core, which are then estimated via marker-based pose tracking and converted into force-like signals. Our design eliminates the need for calibration or specialized electronics to obtain exact values, and instead focuses on capturing force and torque changes sufficient for enabling contact-rich manipulation. Extensive experiments across diverse contact-rich tasks and manipulation policies demonstrate that ShapeForce delivers performance comparable to six-axis force-torque sensors at an extremely low cost.


Building Gradient by Gradient: Decentralised Energy Functions for Bimanual Robot Assembly

Mitchell, Alexander L., Watson, Joe, Posner, Ingmar

arXiv.org Artificial Intelligence

Abstract-- There are many challenges in bimanual assembly, including high-level sequencing, multi-robot coordination, and low-level, contact-rich operations such as component mating. T ask and motion planning (T AMP) methods, while effective in this domain, may be prohibitively slow to converge when adapting to disturbances that require new task sequencing and optimisation. These events are common during tight-tolerance assembly, where difficult-to-model dynamics such as friction or deformation require rapid replanning and reat-tempts. Moreover, defining explicit task sequences for assembly can be cumbersome, limiting flexibility when task replanning is required. T o simplify this planning, we introduce BGBG, a decentralised gradient-based framework that uses a piecewise continuous energy function through the automatic composition of adaptive potential functions. This approach generates sub-goals using only myopic optimisation, rather than long-horizon planning. It demonstrates effectiveness at solving long-horizon tasks due to the structure and adaptivity of the energy function. We show that our approach scales to physical bimanual assembly tasks for constructing tight-tolerance assemblies. In these experiments, we discover that our gradient-based rapid replanning framework generates automatic retries, coordinated motions and autonomous handovers in an emergent fashion. Bimanual assembly is an inherently sequential planning problem that demands reasoning over tasks and motions. The challenge is further amplified in contact-rich settings or when collaborating with humans, making efficient and robust planning essential for reliable execution.


Designing for Distributed Heterogeneous Modularity: On Software Architecture and Deployment of MoonBots

Neppel, Elian, Karimov, Shamistan, Mishra, Ashutosh, Huenupan, Gustavo Hernan Diaz, Gozbasi, Hazal, Uno, Kentaro, Santra, Shreya, Yoshida, Kazuya

arXiv.org Artificial Intelligence

This paper presents the software architecture and deployment strategy behind the MoonBot platform: a modular space robotic system composed of heterogeneous components distributed across multiple computers, networks and ultimately celestial bodies. We introduce a principled approach to distributed, heterogeneous modularity, extending modular robotics beyond physical reconfiguration to software, communication and orchestration. We detail the architecture of our system that integrates component-based design, a data-oriented communication model using ROS2 and Zenoh, and a deployment orchestrator capable of managing complex multi-module assemblies. These abstractions enable dynamic reconfiguration, decentralized control, and seamless collaboration between numerous operators and modules. At the heart of this system lies our open-source Motion Stack software, validated by months of field deployment with self-assembling robots, inter-robot cooperation, and remote operation. Our architecture tackles the significant hurdles of modular robotics by significantly reducing integration and maintenance overhead, while remaining scalable and robust. Although tested with space in mind, we propose generalizable patterns for designing robotic systems that must scale across time, hardware, teams and operational environments.


Model of human cognition

Yonggang, Wu

arXiv.org Artificial Intelligence

Recently, there has been immense development in the field of artificial intelligence (AI) and computational neuroscienc e. Numerous architecture s and models have been implemented in artificial systems to challenge human intelligence, especially with the release of increasingly proficient large language model s (LLMs) . However, despite advancement s in LLMs, artificial systems still fall short in matching the human capacity for generalisation across diverse tasks and environments, thus being an overstatement to label the current generation s of LLMs as artificial general intelligence (AGI) . We propose that in order to create artificial systems with high generalisation capabilities, one must first examine and understand the fundamentals of human cognition through conceptual model s of the brain. This paper introduce s a theoretical model of cognition that integrates biological plausibility and functionality, encapsulating the fundamental elements of cognition and accounting for many psychological and behavioural regularities. The model consists of four main modules: the v isual processing module, the semantic module, the predictive module, and the executive module . The modules are discussed in chronological order, with each being affiliated with corresponding anatomical regions of the brain . Thereafter, the model is substantiated with real - world examples and that reflect its general problem - solving capabilities .