Cheng, Ran
ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model
Zhou, Zhongyi, Zhu, Yichen, Zhu, Minjie, Wen, Junjie, Liu, Ning, Xu, Zhiyuan, Meng, Weibin, Cheng, Ran, Peng, Yaxin, Shen, Chaomin, Feng, Feifei
Humans possess a unified cognitive ability to perceive, comprehend, and interact with the physical world. Why can't large language models replicate this holistic understanding? Through a systematic analysis of existing training paradigms in vision-language-action models (VLA), we identify two key challenges: spurious forgetting, where robot training overwrites crucial visual-text alignments, and task interference, where competing control and understanding tasks degrade performance when trained jointly. To overcome these limitations, we propose ChatVLA, a novel framework featuring Phased Alignment Training, which incrementally integrates multimodal data after initial control mastery, and a Mixture-of-Experts architecture to minimize task interference. ChatVLA demonstrates competitive performance on visual question-answering datasets and significantly surpasses state-of-the-art vision-language-action (VLA) methods on multimodal understanding benchmarks. Notably, it achieves a six times higher performance on MMMU and scores 47.2% on MMStar with a more parameter-efficient design than ECoT. Furthermore, ChatVLA demonstrates superior performance on 25 real-world robot manipulation tasks compared to existing VLA methods like OpenVLA. Our findings highlight the potential of our unified framework for achieving both robust multimodal understanding and effective robot control.
EvoGP: A GPU-accelerated Framework for Tree-based Genetic Programming
Wang, Lishuang, Wu, Zhihong, Sun, Kebin, Li, Zhuozhao, Cheng, Ran
Tree-based Genetic Programming (TGP) is a key evolutionary algorithm widely used in symbolic regression, feature engineering, and scientific modeling. Its high computational demands make GPU acceleration essential for scalable and high-performance evolutionary computation. However, GPU acceleration of TGP faces three key challenges: inefficient tree encoding, highly heterogeneous genetic operations, and limited parallelism in fitness evaluation. To address these challenges, we introduce EvoGP, a comprehensive GPU-accelerated TGP framework. First, we design a tensorized encoding scheme to represent tree with different structures as tensors with the same shape, optimizing memory access and enabling efficient parallel execution. Second, we propose a unified parallel framework for genetic operations by leveraging shared computational primitives and implementing dedicated CUDA kernels for scalable performance. Third, we present a fully parallel fitness evaluation strategy for symbolic regression, exploiting both population-level and data-level parallelism to maximize GPU utilization. Moreover, we implement a comprehensive library to provide rich algorithm operators and benchmark problems. EvoGP is extensively tested on various tasks, including symbolic regression, classification, and robotics control, demonstrating its versatility and effectiveness across diverse application scenarios. Experimental results show that EvoGP achieves up to a 140.89x speedup over the state-of-the-art GPU-based TGP implementation, while maintaining or exceeding the accuracy of baseline methods. EvoGP is open-source and accessible at: https://github.com/EMI-Group/evogp.
ParetoLens: A Visual Analytics Framework for Exploring Solution Sets of Multi-objective Evolutionary Algorithms
Ma, Yuxin, Zhang, Zherui, Cheng, Ran, Jin, Yaochu, Tan, Kay Chen
In the domain of multi-objective optimization, evolutionary algorithms are distinguished by their capability to generate a diverse population of solutions that navigate the trade-offs inherent among competing objectives. This has catalyzed the ascension of evolutionary multi-objective optimization (EMO) as a prevalent approach. Despite the effectiveness of the EMO paradigm, the analysis of resultant solution sets presents considerable challenges. This is primarily attributed to the high-dimensional nature of the data and the constraints imposed by static visualization methods, which frequently culminate in visual clutter and impede interactive exploratory analysis. To address these challenges, this paper introduces ParetoLens, a visual analytics framework specifically tailored to enhance the inspection and exploration of solution sets derived from the multi-objective evolutionary algorithms. Utilizing a modularized, algorithm-agnostic design, ParetoLens enables a detailed inspection of solution distributions in both decision and objective spaces through a suite of interactive visual representations. This approach not only mitigates the issues associated with static visualizations but also supports a more nuanced and flexible analysis process. The usability of the framework is evaluated through case studies and expert interviews, demonstrating its potential to uncover complex patterns and facilitate a deeper understanding of multi-objective optimization solution sets. A demo website of ParetoLens is available at https://dva-lab.org/paretolens/.
Improving Vision-Language-Action Models via Chain-of-Affordance
Li, Jinming, Zhu, Yichen, Tang, Zhibin, Wen, Junjie, Zhu, Minjie, Liu, Xiaoyu, Li, Chengmeng, Cheng, Ran, Peng, Yaxin, Feng, Feifei
Robot foundation models, particularly Vision-Language-Action (VLA) models, have garnered significant attention for their ability to enhance robot policy learning, greatly improving robot generalization and robustness. OpenAI recent model, o1, showcased impressive capabilities in solving complex problems by utilizing extensive reasoning chains. This prompts an important question: can robot models achieve better performance in multi-task, complex environments by reviewing prior observations and then providing task-specific reasoning to guide action prediction? In this paper, we introduce \textbf{Chain-of-Affordance (CoA)}, a novel approach to scaling robot models by incorporating reasoning in the format of sequential robot affordances to facilitate task completion. Specifically, we prompt the model to consider the following four types of affordances before taking action: a) object affordance - what object to manipulate and where it is; b) grasp affordance - the specific object part to grasp; c) spatial affordance - the optimal space to place the object; and d) movement affordance - the collision-free path for movement. By integrating this knowledge into the policy model, the robot gains essential context, allowing it to act with increased precision and robustness during inference. Our experiments demonstrate that CoA achieves superior performance than state-of-the-art robot foundation models, such as OpenVLA and Octo. Additionally, CoA shows strong generalization to unseen object poses, identifies free space, and avoids obstacles in novel environments.
SHIFT Planner: Speedy Hybrid Iterative Field and Segmented Trajectory Optimization with IKD-tree for Uniform Lightweight Coverage
Fan, Zexuan, Zhou, Sunchun, Yang, Hengye, Cai, Junyi, Cheng, Ran, Liu, Lige, Sun, Tao
This paper introduces a comprehensive planning and navigation framework that address these limitations by integrating semantic mapping, adaptive coverage planning, dynamic obstacle avoidance and precise trajectory tracking. Our framework begins by generating panoptic occupancy local semantic maps and accurate localization information from data aligned between a monocular camera, IMU, and GPS. This information is combined with input terrain point clouds or preloaded terrain information to initialize the planning process. We propose the Radiant Field-Informed Coverage Planning algorithm, which utilizes a diffusion field model to dynamically adjust the robot's coverage trajectory and speed based on environmental attributes such as dirtiness and dryness. By modeling the spatial influence of the robot's actions using a Gaussian field, ensures a speed-optimized, uniform coverage trajectory while adapting to varying environmental conditions.
Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation
Zhu, Minjie, Zhu, Yichen, Li, Jinming, Wen, Junjie, Xu, Zhiyuan, Liu, Ning, Cheng, Ran, Shen, Chaomin, Peng, Yaxin, Feng, Feifei, Tang, Jian
Diffusion Policy is a powerful technique tool for learning end-to-end visuomotor robot control. It is expected that Diffusion Policy possesses scalability, a key attribute for deep neural networks, typically suggesting that increasing model size would lead to enhanced performance. However, our observations indicate that Diffusion Policy in transformer architecture (\DP) struggles to scale effectively; even minor additions of layers can deteriorate training outcomes. To address this issue, we introduce Scalable Diffusion Transformer Policy for visuomotor learning. Our proposed method, namely \textbf{\methodname}, introduces two modules that improve the training dynamic of Diffusion Policy and allow the network to better handle multimodal action distribution. First, we identify that \DP~suffers from large gradient issues, making the optimization of Diffusion Policy unstable. To resolve this issue, we factorize the feature embedding of observation into multiple affine layers, and integrate it into the transformer blocks. Additionally, our utilize non-causal attention which allows the policy network to \enquote{see} future actions during prediction, helping to reduce compounding errors. We demonstrate that our proposed method successfully scales the Diffusion Policy from 10 million to 1 billion parameters. This new model, named \methodname, can effectively scale up the model size with improved performance and generalization. We benchmark \methodname~across 50 different tasks from MetaWorld and find that our largest \methodname~outperforms \DP~with an average improvement of 21.6\%. Across 7 real-world robot tasks, our ScaleDP demonstrates an average improvement of 36.25\% over DP-T on four single-arm tasks and 75\% on three bimanual tasks. We believe our work paves the way for scaling up models for visuomotor learning. The project page is available at scaling-diffusion-policy.github.io.
Voltage-Controlled Magnetoelectric Devices for Neuromorphic Diffusion Process
Cheng, Yang, Shu, Qingyuan, Lee, Albert, He, Haoran, Zhu, Ivy, Suhail, Haris, Chen, Minzhang, Chen, Renhe, Wang, Zirui, Zhang, Hantao, Wang, Chih-Yao, Yang, Shan-Yi, Hsin, Yu-Chen, Shih, Cheng-Yi, Lee, Hsin-Han, Cheng, Ran, Pamarti, Sudhakar, Kou, Xufeng, Wang, Kang L.
Stochastic diffusion processes are pervasive in nature, from the seemingly erratic Brownian motion to the complex interactions of synaptically-coupled spiking neurons. Recently, drawing inspiration from Langevin dynamics, neuromorphic diffusion models were proposed and have become one of the major breakthroughs in the field of generative artificial intelligence. Unlike discriminative models that have been well developed to tackle classification or regression tasks, diffusion models as well as other generative models such as ChatGPT aim at creating content based upon contexts learned. However, the more complex algorithms of these models result in high computational costs using today's technologies, creating a bottleneck in their efficiency, and impeding further development. Here, we develop a spintronic voltage-controlled magnetoelectric memory hardware for the neuromorphic diffusion process. The in-memory computing capability of our spintronic devices goes beyond current Von Neumann architecture, where memory and computing units are separated. Together with the non-volatility of magnetic memory, we can achieve high-speed and low-cost computing, which is desirable for the increasing scale of generative models in the current era. We experimentally demonstrate that the hardware-based true random diffusion process can be implemented for image generation and achieve comparable image quality to software-based training as measured by the Frechet inception distance (FID) score, achieving ~10^3 better energy-per-bit-per-area over traditional hardware.
Benchmarking Neural Decoding Backbones towards Enhanced On-edge iBCI Applications
Zhou, Zhou, He, Guohang, Zhang, Zheng, Leng, Luziwei, Guo, Qinghai, Liao, Jianxing, Song, Xuan, Cheng, Ran
Traditional invasive Brain-Computer Interfaces (iBCIs) typically depend on neural decoding processes conducted on workstations within laboratory settings, which prevents their everyday usage. Implementing these decoding processes on edge devices, such as the wearables, introduces considerable challenges related to computational demands, processing speed, and maintaining accuracy. This study seeks to identify an optimal neural decoding backbone that boasts robust performance and swift inference capabilities suitable for edge deployment. We executed a series of neural decoding experiments involving nonhuman primates engaged in random reaching tasks, evaluating four prospective models, Gated Recurrent Unit (GRU), Transformer, Receptance Weighted Key Value (RWKV), and Selective State Space model (Mamba), across several metrics: single-session decoding, multi-session decoding, new session fine-tuning, inference speed, calibration speed, and scalability. The findings indicate that although the GRU model delivers sufficient accuracy, the RWKV and Mamba models are preferable due to their superior inference and calibration speeds. Additionally, RWKV and Mamba comply with the scaling law, demonstrating improved performance with larger data sets and increased model sizes, whereas GRU shows less pronounced scalability, and the Transformer model requires computational resources that scale prohibitively. This paper presents a thorough comparative analysis of the four models in various scenarios. The results are pivotal in pinpointing an optimal backbone that can handle increasing data volumes and is viable for edge implementation. This analysis provides essential insights for ongoing research and practical applications in the field.
FR-NAS: Forward-and-Reverse Graph Predictor for Efficient Neural Architecture Search
Zhang, Haoming, Cheng, Ran
Neural Architecture Search (NAS) has emerged as a key tool in identifying optimal configurations of deep neural networks tailored to specific tasks. However, training and assessing numerous architectures introduces considerable computational overhead. One method to mitigating this is through performance predictors, which offer a means to estimate the potential of an architecture without exhaustive training. Given that neural architectures fundamentally resemble Directed Acyclic Graphs (DAGs), Graph Neural Networks (GNNs) become an apparent choice for such predictive tasks. Nevertheless, the scarcity of training data can impact the precision of GNN-based predictors. To address this, we introduce a novel GNN predictor for NAS. This predictor renders neural architectures into vector representations by combining both the conventional and inverse graph views. Additionally, we incorporate a customized training loss within the GNN predictor to ensure efficient utilization of both types of representations. We subsequently assessed our method through experiments on benchmark datasets including NAS-Bench-101, NAS-Bench-201, and the DARTS search space, with a training dataset ranging from 50 to 400 samples. Benchmarked against leading GNN predictors, the experimental results showcase a significant improvement in prediction accuracy, with a 3%--16% increase in Kendall-tau correlation. Source codes are available at https://github.com/EMI-Group/fr-nas.
Generalized Population-Based Training for Hyperparameter Optimization in Reinforcement Learning
Bai, Hui, Cheng, Ran
Hyperparameter optimization plays a key role in the machine learning domain. Its significance is especially pronounced in reinforcement learning (RL), where agents continuously interact with and adapt to their environments, requiring dynamic adjustments in their learning trajectories. To cater to this dynamicity, the Population-Based Training (PBT) was introduced, leveraging the collective intelligence of a population of agents learning simultaneously. However, PBT tends to favor high-performing agents, potentially neglecting the explorative potential of agents on the brink of significant advancements. To mitigate the limitations of PBT, we present the Generalized Population-Based Training (GPBT), a refined framework designed for enhanced granularity and flexibility in hyperparameter adaptation. Complementing GPBT, we further introduce Pairwise Learning (PL). Instead of merely focusing on elite agents, PL employs a comprehensive pairwise strategy to identify performance differentials and provide holistic guidance to underperforming agents. By integrating the capabilities of GPBT and PL, our approach significantly improves upon traditional PBT in terms of adaptability and computational efficiency. Rigorous empirical evaluations across a range of RL benchmarks confirm that our approach consistently outperforms not only the conventional PBT but also its Bayesian-optimized variant.