Not enough data to create a plot.
Try a different view from the menu above.
Wang, Zhigang
COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models
Liu, Kehui, Tang, Zixin, Wang, Dong, Wang, Zhigang, Zhao, Bin, Li, Xuelong
Leveraging the powerful reasoning capabilities of large language models (LLMs), recent LLM-based robot task planning methods yield promising results. However, they mainly focus on single or multiple homogeneous robots on simple tasks. Practically, complex long-horizon tasks always require collaborations among multiple heterogeneous robots especially with more complex action spaces, which makes these tasks more challenging. To this end, we propose COHERENT, a novel LLM-based task planning framework for collaboration of heterogeneous multi-robot systems including quadrotors, robotic dogs, and robotic arms. Specifically, a Proposal-Execution-Feedback-Adjustment (PEFA) mechanism is designed to decompose and assign actions for individual robots, where a centralized task assigner makes a task planning proposal to decompose the complex task into subtasks, and then assigns subtasks to robot executors. Each robot executor selects a feasible action to implement the assigned subtask and reports self-reflection feedback to the task assigner for plan adjustment. The PEFA loops until the task is completed. Moreover, we create a challenging heterogeneous multi-robot task planning benchmark encompassing 100 complex long-horizon tasks. The experimental results show that our work surpasses the previous methods by a large margin in terms of success rate and execution efficiency. The experimental videos, code, and benchmark are released at https://github.com/MrKeee/COHERENT.
Revolutionizing Battery Disassembly: The Design and Implementation of a Battery Disassembly Autonomous Mobile Manipulator Robot(BEAM-1)
Peng, Yanlong, Wang, Zhigang, Zhang, Yisheng, Zhang, Shengmin, Cai, Nan, Wu, Fan, Chen, Ming
The efficient disassembly of end-of-life electric vehicle batteries(EOL-EVBs) is crucial for green manufacturing and sustainable development. The current pre-programmed disassembly conducted by the Autonomous Mobile Manipulator Robot(AMMR) struggles to meet the disassembly requirements in dynamic environments, complex scenarios, and unstructured processes. In this paper, we propose a Battery Disassembly AMMR(BEAM-1) system based on NeuralSymbolic AI. It detects the environmental state by leveraging a combination of multi-sensors and neural predicates and then translates this information into a quasi-symbolic space. In real-time, it identifies the optimal sequence of action primitives through LLM-heuristic tree search, ensuring high-precision execution of these primitives. Additionally, it employs positional speculative sampling using intuitive networks and achieves the disassembly of various bolt types with a meticulously designed end-effector. Importantly, BEAM-1 is a continuously learning embodied intelligence system capable of subjective reasoning like a human, and possessing intuition. A large number of real scene experiments have proved that it can autonomously perceive, decide, and execute to complete the continuous disassembly of bolts in multiple, multi-category, and complex situations, with a success rate of 98.78%. This research attempts to use NeuroSymbolic AI to give robots real autonomous reasoning, planning, and learning capabilities. BEAM-1 realizes the revolution of battery disassembly. Its framework can be easily ported to any robotic system to realize different application scenarios, which provides a ground-breaking idea for the design and implementation of future embodied intelligent robotic systems.
What's Next? Exploring Utilization, Challenges, and Future Directions of AI-Generated Image Tools in Graphic Design
Tang, Yuying, Ciancia, Mariana, Wang, Zhigang, Gao, Ze
Recent advancements in artificial intelligence, such as computer vision and deep learning, have led to the emergence of numerous generative AI platforms, particularly for image generation. However, the application of AI-generated image tools in graphic design has not been extensively explored. This study conducted semi-structured interviews with seven designers of varying experience levels to understand their current usage, challenges, and future functional needs for AI-generated image tools in graphic design. As our findings suggest, AI tools serve as creative partners in design, enhancing human creativity, offering strategic insights, and fostering team collaboration and communication. The findings provide guiding recommendations for the future development of AI-generated image tools, aimed at helping engineers optimize these tools to better meet the needs of graphic designers.
Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding
Tang, Yiwen, Zhang, Ray, Liu, Jiaming, Guo, Zoey, Wang, Dong, Wang, Zhigang, Zhao, Bin, Zhang, Shanghang, Gao, Peng, Li, Hongsheng, Li, Xuelong
Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantly, their frameworks are mainly designed for 2D models, lacking a general any-to-3D paradigm. In this paper, we introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. Given a frozen transformer from any source modality, we propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables us to assign each 3D token with a positional encoding paired with the pre-trained model, which avoids 3D geometry loss caused by the true projection and better motivates the transformer for 3D learning with 1D/2D positional priors. Then, within each transformer block, we insert an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers. We conduct extensive experiments to showcase the effectiveness and efficiency of our method. Code and models are released at https://github.com/Ivan-Tang-3D/Any2Point.
SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation
Zhang, Junjie, Bai, Chenjia, He, Haoran, Xia, Wenke, Wang, Zhigang, Zhao, Bin, Li, Xiu, Li, Xuelong
Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction. Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector. However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning. In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios. To address long-horizon reasoning, we develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency. Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.
Towards Agile Robots: Intuitive Robot Position Speculation with Neural Networks
Peng, Yanlong, Wang, Zhigang, Zhang, Yisheng, Zhang, Shengmin, Chen, Ming
The robot position speculation, which determines where the chassis should move, is one key step to control the mobile manipulators. The target position must ensure the feasibility of chassis movement and manipulability, which is guaranteed by randomized sampling and kinematic checking in traditional methods. Addressing the demands of agile robotics, this paper proposes a robot position speculation network(RPSN), a learning-based approach to enhance the agility of mobile manipulators. The RPSN incorporates a differentiable inverse kinematic algorithm and a neural network. Through end-to-end training, the RPSN can speculate positions with a high success rate. We apply the RPSN to mobile manipulators disassembling end-of-life electric vehicle batteries (EOL-EVBs). Extensive experiments on various simulated environments and physical mobile manipulators demonstrate that the probability of the initial position provided by RPSN being the ideal position is 96.67%. From the kinematic constraint perspective, it achieves 100% generation of the ideal position on average within 1.28 attempts. Much lower than that of random sampling, 31.04. Moreover, the proposed method demonstrates superior data efficiency over pure neural network approaches. The proposed RPSN enables the robot to quickly infer feasible target positions by intuition. This work moves towards building agile robots that can act swiftly like humans.
LR-CNN: Lightweight Row-centric Convolutional Neural Network Training for Memory Reduction
Wang, Zhigang, Yang, Hangyu, Wang, Ning, Xu, Chuanfei, Nie, Jie, Wei, Zhiqiang, Gu, Yu, Yu, Ge
In the last decade, Convolutional Neural Network with a multi-layer architecture has advanced rapidly. However, training its complex network is very space-consuming, since a lot of intermediate data are preserved across layers, especially when processing high-dimension inputs with a big batch size. That poses great challenges to the limited memory capacity of current accelerators (e.g., GPUs). Existing efforts mitigate such bottleneck by external auxiliary solutions with additional hardware costs, and internal modifications with potential accuracy penalty. Differently, our analysis reveals that computations intra- and inter-layers exhibit the spatial-temporal weak dependency and even complete independency features. That inspires us to break the traditional layer-by-layer (column) dataflow rule. Now operations are novelly re-organized into rows throughout all convolution layers. This lightweight design allows a majority of intermediate data to be removed without any loss of accuracy. We particularly study the weak dependency between two consecutive rows. For the resulting skewed memory consumption, we give two solutions with different favorite scenarios. Evaluations on two representative networks confirm the effectiveness. We also validate that our middle dataflow optimization can be smoothly embraced by existing works for better memory reduction.
Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models
Tang, Yiwen, Zhang, Ray, Guo, Zoey, Ma, Xianzheng, Wang, Dong, Wang, Zhigang, Zhao, Bin, Li, Xuelong
The popularity of pre-trained large models has revolutionized downstream tasks across diverse fields, such as language, vision, and multi-modality. To minimize the adaption cost for downstream tasks, many Parameter-Efficient Fine-Tuning (PEFT) techniques are proposed for language and 2D image pre-trained models. However, the specialized PEFT method for 3D pre-trained models is still under-explored. To this end, we introduce Point-PEFT, a novel framework for adapting point cloud pre-trained models with minimal learnable parameters. Specifically, for a pre-trained 3D model, we freeze most of its parameters, and only tune the newly added PEFT modules on downstream tasks, which consist of a Point-prior Prompt and a Geometry-aware Adapter. The Point-prior Prompt adopts a set of learnable prompt tokens, for which we propose to construct a memory bank with domain-specific knowledge, and utilize a parameter-free attention to enhance the prompt tokens. The Geometry-aware Adapter aims to aggregate point cloud features within spatial neighborhoods to capture fine-grained geometric information through local interactions. Extensive experiments indicate that our Point-PEFT can achieve better performance than the full fine-tuning on various downstream tasks, while using only 5% of the trainable parameters, demonstrating the efficiency and effectiveness of our approach. Code is released at https://github.com/Ivan-Tang-3D/PEFT-3D.
Calibration-free quantitative phase imaging in multi-core fiber endoscopes using end-to-end deep learning
Sun, Jiawei, Zhao, Bin, Wang, Dong, Wang, Zhigang, Zhang, Jie, Koukourakis, Nektarios, Czarske, Juergen W., Li, Xuelong
Fiber endoscopes have emerged as a vital tool for highresolution Recent advancements have adopted deep learning techniques microscopic imaging in hard-to-reach areas. In contrast to expedite the QPI image reconstruction process [12, 13]. Moreover, to conventional endoscopes with a typical diameter of extant literature indicates the potential of decrypting an several millimeters, fiber endoscopes, which could be submillimeter encoded phase directly from speckle images utilizing deep learning, thin and flexible [1-5], can pass through the organ's although only in simulated environments [14]. This demonstrates intricate pathways without causing harm inside the body [6], the theoretical possibility of reconstructing the original making them particularly suitable for procedures requiring utmost phase directly from speckle images using deep learning for MCF precision and minimal invasiveness. The reduced size and phase imaging, however, networks trained on simulated data adaptability of fiber endoscopes ensure less discomfort for the can hardly achieve accurate phase reconstructions in real-world patient, leading to quicker recovery times and a lower risk of optical systems.
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance
Guo, Zoey, Tang, Yiwen, Zhang, Ray, Wang, Dong, Wang, Zhigang, Zhao, Bin, Li, Xuelong
Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models, e.g., GPT, to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8%, +1.5%, and +1.35% on Sr3D, Nr3D, and ScanRefer. Code is released at https://github.com/Ivan-Tang-3D/ViewRefer3D.