Yang, Chao
Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models
Zhou, Zhanhui, Liu, Zhixuan, Liu, Jie, Dong, Zhichen, Yang, Chao, Qiao, Yu
Large language models are usually fine-tuned to align with human preferences. However, fine-tuning a large language model can be challenging. In this work, we introduce weak-to-strong search, framing the alignment of a large language model as a test-time greedy search to maximize the log-likelihood difference between small tuned and untuned models while sampling from the frozen large model. This method serves both as (i) a compute-efficient model up-scaling strategy that avoids directly tuning the large model and as (ii) an instance of weak-to-strong generalization that enhances a strong model with weak test-time guidance. Empirically, we demonstrate the flexibility of weak-to-strong search across different tasks. In controlled-sentiment generation and summarization, we use tuned and untuned gpt2s to effectively improve the alignment of large models without additional training. Crucially, in a more difficult instruction-following benchmark, AlpacaEval 2.0, we show that reusing off-the-shelf small models (e.g., zephyr-7b-beta and its untuned version) can significantly improve the length-controlled win rates of both white-box and black-box large models against gpt-4-turbo (e.g., 34.4 37.9 for Llama-3-70B-Instruct and 16.0 20.1 for gpt-3.5-turbo-instruct),
Discovering Physics-Informed Neural Networks Model for Solving Partial Differential Equations through Evolutionary Computation
Zhang, Bo, Yang, Chao
In recent years, the researches about solving partial differential equations (PDEs) based on artificial neural network have attracted considerable attention. In these researches, the neural network models are usually designed depend on human experience or trial and error. Despite the emergence of several model searching methods, these methods primarily concentrate on optimizing the hyperparameters of fully connected neural network model based on the framework of physics-informed neural networks (PINNs), and the corresponding search spaces are relatively restricted, thereby limiting the exploration of superior models. This article proposes an evolutionary computation method aimed at discovering the PINNs model with higher approximation accuracy and faster convergence rate. In addition to searching the numbers of layers and neurons per hidden layer, this method concurrently explores the optimal shortcut connections between the layers and the novel parametric activation functions expressed by the binary trees. In evolution, the strategy about dynamic population size and training epochs (DPSTE) is adopted, which significantly increases the number of models to be explored and facilitates the discovery of models with fast convergence rate. In experiments, the performance of different models that are searched through Bayesian optimization, random search and evolution is compared in solving Klein-Gordon, Burgers, and Lam\'e equations. The experimental results affirm that the models discovered by the proposed evolutionary computation method generally exhibit superior approximation accuracy and convergence rate, and these models also show commendable generalization performance with respect to the source term, initial and boundary conditions, equation coefficient and computational domain. The corresponding code is available at https://github.com/MathBon/Discover-PINNs-Model.
U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF
Song, Xingchen, Wu, Di, Zhang, Binbin, Zhou, Dinghao, Peng, Zhendong, Dang, Bo, Pan, Fuping, Yang, Chao
Scale has opened new frontiers in natural language processing, but at a high cost. In response, by learning to only activate a subset of parameters in training and inference, Mixture-of-Experts (MoE) have been proposed as an energy efficient path to even larger and more capable language models and this shift towards a new generation of foundation models is gaining momentum, particularly within the field of Automatic Speech Recognition (ASR). Recent works that incorporating MoE into ASR models have complex designs such as routing frames via supplementary embedding network, improving multilingual ability for the experts, and utilizing dedicated auxiliary losses for either expert load balancing or specific language handling. We found that delicate designs are not necessary, while an embarrassingly simple substitution of MoE layers for all Feed-Forward Network (FFN) layers is competent for the ASR task. To be more specific, we benchmark our proposed model on a large scale inner-source dataset (160k hours), the results show that we can scale our baseline Conformer (Dense-225M) to its MoE counterparts (MoE-1B) and achieve Dense-1B level Word Error Rate (WER) while maintaining a Dense-225M level Real Time Factor (RTF). Furthermore, by applying Unified 2-pass framework with bidirectional attention decoders (U2++), we achieve the streaming and non-streaming decoding modes in a single MoE based model, which we call U2++ MoE. We hope that our study can facilitate the research on scaling speech foundation models without sacrificing deployment efficiency.
LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction
Zou, Bo, Yang, Chao, Qiao, Yu, Quan, Chengbin, Zhao, Youjian
Existing methods to fine-tune LLMs, like Adapter, Prefix-tuning, and LoRA, which introduce extra modules or additional input sequences to inject new skills or knowledge, may compromise the innate abilities of LLMs. In this paper, we propose LLaMA-Excitor, a lightweight method that stimulates the LLMs' potential to better follow instructions by gradually paying more attention to worthwhile information. Specifically, the LLaMA-Excitor does not directly change the intermediate hidden state during the self-attention calculation of the transformer structure. We designed the Excitor block as a bypass module for the similarity score computation in LLMs' self-attention to reconstruct keys and change the importance of values by learnable prompts. LLaMA-Excitor ensures a self-adaptive allocation of additional attention to input instructions, thus effectively preserving LLMs' pre-trained knowledge when fine-tuning LLMs on low-quality instruction-following datasets. Furthermore, we unify the modeling of multi-modal tuning and language-only tuning, extending LLaMA-Excitor to a powerful visual instruction follower without the need for complex multi-modal alignment. Our proposed approach is evaluated in language-only and multi-modal tuning experimental scenarios. Notably, LLaMA-Excitor is the only method that maintains basic capabilities while achieving a significant improvement (+6%) on the MMLU benchmark. In the visual instruction tuning, we achieve a new state-of-the-art image captioning performance of 157.5 CIDEr on MSCOCO, and a comparable performance (88.39%) on ScienceQA to cutting-edge models with more parameters and extensive vision-language pertaining.
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
Dong, Zhichen, Zhou, Zhanhui, Yang, Chao, Shao, Jing, Qiao, Yu
Large Language Models (LLMs) are now commonplace in conversation applications. However, their risks of misuse for generating harmful responses have raised serious societal concerns and spurred recent research on LLM conversation safety. Therefore, in this survey, we provide a comprehensive overview of recent studies, covering three critical aspects of LLM conversation safety: attacks, defenses, and evaluations. Our goal is to provide a structured summary that enhances understanding of LLM conversation safety and encourages further investigation into this important subject. For easy reference, we have categorized all the studies mentioned in this survey according to our taxonomy, available at: https://github.com/niconi19/LLM-conversation-safety.
GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping
Zheng, Yuhang, Chen, Xiangyu, Zheng, Yupeng, Gu, Songen, Yang, Runyi, Jin, Bu, Li, Pengfei, Zhong, Chengliang, Wang, Zengmao, Liu, Lina, Yang, Chao, Wang, Dawei, Chen, Zhen, Long, Xiaoxiao, Wang, Meiqing
Constructing a 3D scene capable of accommodating open-ended language queries, is a pivotal pursuit, particularly within the domain of robotics. Such technology facilitates robots in executing object manipulations based on human language directives. To tackle this challenge, some research efforts have been dedicated to the development of language-embedded implicit fields. However, implicit fields (e.g. NeRF) encounter limitations due to the necessity of processing a large number of input views for reconstruction, coupled with their inherent inefficiencies in inference. Thus, we present the GaussianGrasper, which utilizes 3D Gaussian Splatting to explicitly represent the scene as a collection of Gaussian primitives. Our approach takes a limited set of RGB-D views and employs a tile-based splatting technique to create a feature field. In particular, we propose an Efficient Feature Distillation (EFD) module that employs contrastive learning to efficiently and accurately distill language embeddings derived from foundational models. With the reconstructed geometry of the Gaussian field, our method enables the pre-trained grasping model to generate collision-free grasp pose candidates. Furthermore, we propose a normal-guided grasp module to select the best grasp pose. Through comprehensive real-world experiments, we demonstrate that GaussianGrasper enables robots to accurately query and grasp objects with language instructions, providing a new solution for language-guided manipulation tasks. Data and codes can be available at https://github.com/MrSecant/GaussianGrasper.
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis
Mu, Yao, Chen, Junting, Zhang, Qinglong, Chen, Shoufa, Yu, Qiaojun, Ge, Chongjian, Chen, Runjian, Liang, Zhixuan, Hu, Mengkang, Tao, Chaofan, Sun, Peize, Yu, Haibao, Yang, Chao, Shao, Wenqi, Wang, Wenhai, Dai, Jifeng, Qiao, Yu, Ding, Mingyu, Luo, Ping
Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal large language models for high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various scenarios. In this paper, we propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints, and applies code generation to introduce generalization ability across various robotics platforms. To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning. Extensive experiments demonstrate that RoboCodeX achieves state-of-the-art performance in both simulators and real robots on four different kinds of manipulation tasks and one navigation task.
RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation
Chen, Junting, Mu, Yao, Yu, Qiaojun, Wei, Tianming, Wu, Silang, Yuan, Zhecheng, Liang, Zhixuan, Yang, Chao, Zhang, Kaipeng, Shao, Wenqi, Qiao, Yu, Xu, Huazhe, Ding, Mingyu, Luo, Ping
Rapid progress in high-level task planning and code generation for open-world robot manipulation has been witnessed in Embodied AI. However, previous studies put much effort into general common sense reasoning and task planning capabilities of large-scale language or multi-modal models, relatively little effort on ensuring the deployability of generated code on real robots, and other fundamental components of autonomous robot systems including robot perception, motion planning, and control. To bridge this ``ideal-to-real'' gap, this paper presents \textbf{RobotScript}, a platform for 1) a deployable robot manipulation pipeline powered by code generation; and 2) a code generation benchmark for robot manipulation tasks in free-form natural language. The RobotScript platform addresses this gap by emphasizing the unified interface with both simulation and real robots, based on abstraction from the Robot Operating System (ROS), ensuring syntax compliance and simulation validation with Gazebo. We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms, and multiple grippers. Additionally, our benchmark assesses reasoning abilities for physical space and constraints, highlighting the differences between GPT-3.5, GPT-4, and Gemini in handling complex physical interactions. Finally, we present a thorough evaluation on the whole system, exploring how each module in the pipeline: code generation, perception, motion planning, and even object geometric properties, impact the overall performance of the system.
TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning
Li, Xiang, Lan, Yunshi, Yang, Chao
Recently, numerous new benchmarks have been established to evaluate the performance of large language models (LLMs) via either computing a holistic score or employing another LLM as a judge. However, these approaches suffer from data leakage due to the open access of the benchmark and inflexible evaluation process. To address this issue, we introduce $\textbf{TreeEval}$, a benchmark-free evaluation method for LLMs that let a high-performance LLM host an irreproducible evaluation session and essentially avoids the data leakage. Moreover, this LLM performs as an examiner to raise up a series of questions under a topic with a tree planing strategy, which considers the current evaluation status to decide the next question generation and ensures the completeness and efficiency of the evaluation process. We evaluate $6$ models of different parameter sizes, including $7$B, $13$B, and $33$B, and ultimately achieved the highest correlation coefficient with AlpacaEval2.0 using only around $45$ questions. We also conduct more analysis to show the robustness and reliability of TreeEval. Our code can be accessed via the provided https://github.com/Ashura5/TreeEval.
Deep adaptive sampling for surrogate modeling without labeled data
Wang, Xili, Tang, Kejun, Zhai, Jiayu, Wan, Xiaoliang, Yang, Chao
Surrogate modeling is of great practical significance for parametric differential equation systems. In contrast to classical numerical methods, using physics-informed deep learning methods to construct simulators for such systems is a promising direction due to its potential to handle high dimensionality, which requires minimizing a loss over a training set of random samples. However, the random samples introduce statistical errors, which may become the dominant errors for the approximation of low-regularity and high-dimensional problems. In this work, we present a deep adaptive sampling method for surrogate modeling ($\text{DAS}^2$), where we generalize the deep adaptive sampling (DAS) method [62] [Tang, Wan and Yang, 2023] to build surrogate models for low-regularity parametric differential equations. In the parametric setting, the residual loss function can be regarded as an unnormalized probability density function (PDF) of the spatial and parametric variables. This PDF is approximated by a deep generative model, from which new samples are generated and added to the training set. Since the new samples match the residual-induced distribution, the refined training set can further reduce the statistical error in the current approximate solution. We demonstrate the effectiveness of $\text{DAS}^2$ with a series of numerical experiments, including the parametric lid-driven 2D cavity flow problem with a continuous range of Reynolds numbers from 100 to 1000.