Chen, Jiale
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
Panferov, Andrei, Chen, Jiale, Tabesh, Soroush, Castro, Roberto L., Nikdan, Mahdi, Alistarh, Dan
One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330v2) put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, which is Pareto-competitive with FP16, i.e., it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.
Scalable Mechanistic Neural Networks
Chen, Jiale, Yao, Dingling, Pervez, Adeel, Alistarh, Dan, Locatello, Francesco
We propose Scalable Mechanistic Neural Network (S-MNN), an enhanced neural network framework designed for scientific machine learning applications involving long temporal sequences. By reformulating the original Mechanistic Neural Network (MNN) (Pervez et al., 2024), we reduce the computational time and space complexities from cubic and quadratic with respect to the sequence length, respectively, to linear. This significant improvement enables efficient modeling of long-term dynamics without sacrificing accuracy or interpretability. Extensive experiments demonstrate that S-MNN matches the original MNN in precision while substantially reducing computational resources. Consequently, S-MNN can drop-in replace the original MNN in applications, providing a practical and efficient tool for integrating mechanistic bottlenecks into neural network models of complex dynamical systems.
Enhancing Recommendation with Denoising Auxiliary Task
Liu, Pengsheng, Zheng, Linan, Chen, Jiale, Zhang, Guangfa, Xu, Yang, Fang, Jinyun
The historical interaction sequences of users plays a crucial role in training recommender systems that can accurately predict user preferences. However, due to the arbitrariness of user behavior, the presence of noise in these sequences poses a challenge to predicting their next actions in recommender systems. To address this issue, our motivation is based on the observation that training noisy sequences and clean sequences (sequences without noise) with equal weights can impact the performance of the model. We propose a novel self-supervised Auxiliary Task Joint Training (ATJT) method aimed at more accurately reweighting noisy sequences in recommender systems. Specifically, we strategically select subsets from users' original sequences and perform random replacements to generate artificially replaced noisy sequences. Subsequently, we perform joint training on these artificially replaced noisy sequences and the original sequences. Through effective reweighting, we incorporate the training results of the noise recognition model into the recommender model. We evaluate our method on three datasets using a consistent base model. Experimental results demonstrate the effectiveness of introducing self-supervised auxiliary task to enhance the base model's performance.
The RoboDepth Challenge: Methods and Advancements Towards Robust Depth Estimation
Kong, Lingdong, Niu, Yaru, Xie, Shaoyuan, Hu, Hanjiang, Ng, Lai Xing, Cottereau, Benoit R., Zhao, Ding, Zhang, Liangjun, Wang, Hesheng, Ooi, Wei Tsang, Zhu, Ruijie, Song, Ziyang, Liu, Li, Zhang, Tianzhu, Yu, Jun, Jing, Mohan, Li, Pengwei, Qi, Xiaohua, Jin, Cheng, Chen, Yingfeng, Hou, Jie, Zhang, Jie, Kan, Zhen, Ling, Qiang, Peng, Liang, Li, Minglei, Xu, Di, Yang, Changpeng, Yao, Yuanqi, Wu, Gang, Kuai, Jian, Liu, Xianming, Jiang, Junjun, Huang, Jiamian, Li, Baojun, Chen, Jiale, Zhang, Shuang, Ao, Sun, Li, Zhenyu, Chen, Runze, Luo, Haiyong, Zhao, Fang, Yu, Jingze
Accurate depth estimation under out-of-distribution (OoD) scenarios, such as adverse weather conditions, sensor failure, and noise contamination, is desirable for safety-critical applications. Existing depth estimation systems, however, suffer inevitably from real-world corruptions and perturbations and are struggled to provide reliable depth predictions under such cases. In this paper, we summarize the winning solutions from the RoboDepth Challenge -- an academic competition designed to facilitate and advance robust OoD depth estimation. This challenge was developed based on the newly established KITTI-C and NYUDepth2-C benchmarks. We hosted two stand-alone tracks, with an emphasis on robust self-supervised and robust fully-supervised depth estimation, respectively. Out of more than two hundred participants, nine unique and top-performing solutions have appeared, with novel designs ranging from the following aspects: spatial- and frequency-domain augmentations, masked image modeling, image restoration and super-resolution, adversarial training, diffusion-based noise suppression, vision-language pre-training, learned model ensembling, and hierarchical feature enhancement. Extensive experimental analyses along with insightful observations are drawn to better understand the rationale behind each design. We hope this challenge could lay a solid foundation for future research on robust and reliable depth estimation and beyond. The datasets, competition toolkit, workshop recordings, and source code from the winning teams are publicly available on the challenge website.
Fair Grading Algorithms for Randomized Exams
Chen, Jiale, Hartline, Jason, Zoeter, Onno
In a randomized exam, each student is asked a small number of random questions from a large question bank. The predominant grading rule is simple averaging, i.e., calculating grades by averaging scores on the questions each student is asked, which is fair ex-ante, over the randomized questions, but not fair ex-post, on the realized questions. The fair grading problem is to estimate the average grade of each student on the full question bank. The maximum-likelihood estimator for the Bradley-Terry-Luce model on the bipartite student-question graph is shown to be consistent with high probability when the number of questions asked to each student is at least the cubed-logarithm of the number of students. In an empirical study on exam data and in simulations, our algorithm based on the maximum-likelihood estimator significantly outperforms simple averaging in prediction accuracy and ex-post fairness even with a small class and exam size.
Asynchronous Episodic Deep Deterministic Policy Gradient: Towards Continuous Control in Computationally Complex Environments
Zhang, Zhizheng, Chen, Jiale, Chen, Zhibo, Li, Weiping
Deep Deterministic Policy Gradient (DDPG) has been proved to be a successful reinforcement learning (RL) algorithm for continuous control tasks. However, DDPG still suffers from data insufficiency and training inefficiency, especially in computationally complex environments. In this paper, we propose Asynchronous Episodic DDPG (AE-DDPG), as an expansion of DDPG, which can achieve more effective learning with less training time required. First, we design a modified scheme for data collection in an asynchronous fashion. Generally, for asynchronous RL algorithms, sample efficiency or/and training stability diminish as the degree of parallelism increases. We consider this problem from the perspectives of both data generation and data utilization. In detail, we re-design experience replay by introducing the idea of episodic control so that the agent can latch on good trajectories rapidly. In addition, we also inject a new type of noise in action space to enrich the exploration behaviors. Experiments demonstrate that our AE-DDPG achieves higher rewards and requires less time consuming than most popular RL algorithms in Learning to Run task which has a computationally complex environment. Not limited to the control tasks in computationally complex environments, AE-DDPG also achieves higher rewards and 2- to 4-fold improvement in sample efficiency on average compared to other variants of DDPG in MuJoCo environments. Furthermore, we verify the effectiveness of each proposed technique component through abundant ablation study.
Learning to Run challenge solutions: Adapting reinforcement learning methods for neuromusculoskeletal environments
Kidziński, Łukasz, Mohanty, Sharada Prasanna, Ong, Carmichael, Huang, Zhewei, Zhou, Shuchang, Pechenko, Anton, Stelmaszczyk, Adam, Jarosik, Piotr, Pavlov, Mikhail, Kolesnikov, Sergey, Plis, Sergey, Chen, Zhibo, Zhang, Zhizheng, Chen, Jiale, Shi, Jun, Zheng, Zhuobin, Yuan, Chun, Lin, Zhihui, Michalewski, Henryk, Miłoś, Piotr, Osiński, Błażej, Melnik, Andrew, Schilling, Malte, Ritter, Helge, Carroll, Sean, Hicks, Jennifer, Levine, Sergey, Salathé, Marcel, Delp, Scott
In the NIPS 2017 Learning to Run challenge, participants were tasked with building a controller for a musculoskeletal model to make it run as fast as possible through an obstacle course. Top participants were invited to describe their algorithms. In this work, we present eight solutions that used deep reinforcement learning approaches, based on algorithms such as Deep Deterministic Policy Gradient, Proximal Policy Optimization, and Trust Region Policy Optimization. Many solutions use similar relaxations and heuristics, such as reward shaping, frame skipping, discretization of the action space, symmetry, and policy blending. However, each of the eight teams implemented different modifications of the known algorithms.