Liang, Chen
Scaling User Modeling: Large-scale Online User Representations for Ads Personalization in Meta
Zhang, Wei, Li, Dai, Liang, Chen, Zhou, Fang, Zhang, Zhongke, Wang, Xuewei, Li, Ru, Zhou, Yi, Huang, Yaning, Liang, Dong, Wang, Kai, Wang, Zhangyuan, Chen, Zhengxing, Li, Min, Wu, Fenggang, Chen, Minghai, Li, Huayu, Wu, Yunnan, Shu, Zhan, Yuan, Mindi, Reddy, Sri
Effective user representations are pivotal in personalized advertising. However, stringent constraints on training throughput, serving latency, and memory, often limit the complexity and input feature set of online ads ranking models. This challenge is magnified in extensive systems like Meta's, which encompass hundreds of models with diverse specifications, rendering the tailoring of user representation learning for each model impractical. To address these challenges, we present Scaling User Modeling (SUM), a framework widely deployed in Meta's ads ranking system, designed to facilitate efficient and scalable sharing of online user representation across hundreds of ads models. SUM leverages a few designated upstream user models to synthesize user embeddings from massive amounts of user features with advanced modeling techniques. These embeddings then serve as inputs to downstream online ads ranking models, promoting efficient representation sharing. To adapt to the dynamic nature of user features and ensure embedding freshness, we designed SUM Online Asynchronous Platform (SOAP), a latency free online serving system complemented with model freshness and embedding stabilization, which enables frequent user model updates and online inference of user embeddings upon each user request. We share our hands-on deployment experiences for the SUM framework and validate its superiority through comprehensive experiments. To date, SUM has been launched to hundreds of ads ranking models in Meta, processing hundreds of billions of user requests daily, yielding significant online metric gains and infrastructure cost savings.
Provably Robust Semi-Infinite Program Under Collision Constraints via Subdivision
Zhang, Duo, Liang, Chen, Gao, Xifeng, Wu, Kui, Pan, Zherong
We present a semi-infinite program (SIP) solver for trajectory optimizations of general articulated robots. These problems are more challenging than standard Nonlinear Program (NLP) by involving an infinite number of non-convex, collision constraints. Prior SIP solvers based on constraint sampling cannot guarantee the satisfaction of all constraints. Instead, our method uses a conservative bound on articulated body motions to ensure the solution feasibility throughout the optimization procedure. We further use subdivision to adaptively reduce the error in conservative motion estimation. Combined, we prove that our SIP solver guarantees feasibility while approaches the critical point of SIP problems up to arbitrary user-provided precision. We have verified our method on a row of trajectory optimization problems involving industrial robot arms and UAVs, where our method can generate collision-free, locally optimal trajectories within a couple minutes.
Second-Order Convergent Collision-Constrained Optimization-Based Planner
Liang, Chen, Gao, Xifeng, Wu, Kui, Pan, Zherong
Finding robot poses and trajectories represents a foundational aspect of robot motion planning. Despite decades of research, efficiently and robustly addressing these challenges is still difficult. Existing approaches are often plagued by various limitations, such as intricate geometric approximations, violations of collision constraints, or slow first-order convergence. In this paper, we introduce two novel optimization formulations that offer provable robustness, achieving second-order convergence while requiring only a convex approximation of the robot's links and obstacles. Our first method, known as the Explicit Collision Barrier (ECB) method, employs a barrier function to guarantee separation between convex objects. ECB uses an efficient matrix factorization technique, enabling a second-order Newton's method with an iterative complexity linear in the number of separating planes. Our second method, referred to as the Implicit Collision Barrier (ICB) method, further transforms the separating planes into implicit functions of robot poses. We show such an implicit objective function is twice-differentiable, with derivatives evaluated at a linear complexity. To assess the effectiveness of our approaches, we conduct a comparative study with a first-order baseline algorithm across six testing scenarios. Our results unequivocally justify that our method exhibits significantly faster convergence rates compared to the baseline algorithm.
Learning Reduced-Order Soft Robot Controller
Liang, Chen, Gao, Xifeng, Wu, Kui, Pan, Zherong
Deformable robots are notoriously difficult to model or control due to its high-dimensional configuration spaces. Direct trajectory optimization suffers from the curse-of-dimensionality and incurs a high computational cost, while learning-based controller optimization methods are sensitive to hyper-parameter tuning. To overcome these limitations, we hypothesize that high fidelity soft robots can be both simulated and controlled by restricting to low-dimensional spaces. Under such assumption, we propose a two-stage algorithm to identify such simulation- and control-spaces. Our method first identifies the so-called simulation-space that captures the salient deformation modes, to which the robot's governing equation is restricted. We then identify the control-space, to which control signals are restricted. We propose a multi-fidelity Riemannian Bayesian bilevel optimization to identify task-specific control spaces. We show that the dimension of control-space can be less than $10$ for a high-DOF soft robot to accomplish walking and swimming tasks, allowing low-dimensional MPC controllers to be applied to soft robots with tractable computational complexity.
Module-wise Adaptive Distillation for Multimodality Foundation Models
Liang, Chen, Yu, Jiahui, Yang, Ming-Hsuan, Brown, Matthew, Cui, Yin, Zhao, Tuo, Gong, Boqing, Zhou, Tianyi
Pre-trained multimodal foundation models have demonstrated remarkable generalizability but pose challenges for deployment due to their large sizes. One effective approach to reducing their sizes is layerwise distillation, wherein small student models are trained to match the hidden representations of large teacher models at each layer. Motivated by our observation that certain architecture components, referred to as modules, contribute more significantly to the student's performance than others, we propose to track the contributions of individual modules by recording the loss decrement after distillation each module and choose the module with a greater contribution to distill more frequently. Such an approach can be naturally formulated as a multi-armed bandit (MAB) problem, where modules and loss decrements are considered as arms and rewards, respectively. We then develop a modified-Thompson sampling algorithm named OPTIMA to address the nonstationarity of module contributions resulting from model updating. Specifically, we leverage the observed contributions in recent history to estimate the changing contribution of each module and select modules based on these estimations to maximize the cumulative contribution. We evaluate the effectiveness of OPTIMA through distillation experiments on various multimodal understanding and image captioning tasks, using the CoCa-Large model (Yu et al., 2022) as the teacher model.
LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
Li, Yixiao, Yu, Yifan, Zhang, Qingru, Liang, Chen, He, Pengcheng, Chen, Weizhu, Zhao, Tuo
Transformer models have achieved remarkable results in various natural language tasks, but they are often prohibitively large, requiring massive memories and computational resources. To reduce the size and complexity of these models, we propose LoSparse (Low-Rank and Sparse approximation), a novel model compression technique that approximates a weight matrix by the sum of a low-rank matrix and a sparse matrix. Our method combines the advantages of both low-rank approximations and pruning, while avoiding their limitations. Low-rank approximation compresses the coherent and expressive parts in neurons, while pruning removes the incoherent and non-expressive parts in neurons. Pruning enhances the diversity of low-rank approximations, and low-rank approximation prevents pruning from losing too many expressive neurons. We evaluate our method on natural language understanding, question answering, and natural language generation tasks. We show that it significantly outperforms existing compression methods.
Less is More: Task-aware Layer-wise Distillation for Language Model Compression
Liang, Chen, Zuo, Simiao, Zhang, Qingru, He, Pengcheng, Chen, Weizhu, Zhao, Tuo
Layer-wise distillation is a powerful tool to compress large models (i.e. teacher models) into small ones (i.e., student models). The student distills knowledge from the teacher by mimicking the hidden representations of the teacher at every intermediate layer. However, layer-wise distillation is difficult. Since the student has a smaller model capacity than the teacher, it is often under-fitted. Furthermore, the hidden representations of the teacher contain redundant information that the student does not necessarily need for the target task's learning. To address these challenges, we propose a novel Task-aware layEr-wise Distillation (TED). TED designs task-aware filters to align the hidden representations of the student and the teacher at each layer. The filters select the knowledge that is useful for the target task from the hidden representations. As such, TED reduces the knowledge gap between the two models and helps the student to fit better on the target task. We evaluate TED in two scenarios: continual pre-training and fine-tuning. TED demonstrates significant and consistent improvements over existing distillation methods in both scenarios. Code is available at https://github.com/cliang1453/task-aware-distillation.
Contrastive Shapelet Learning for Unsupervised Multivariate Time Series Representation Learning
Liang, Zhiyu, Zhang, Jianfeng, Liang, Chen, Wang, Hongzhi, Liang, Zheng, Pan, Lujia
Recent studies have shown great promise in unsupervised representation learning (URL) for multivariate time series, because URL has the capability in learning generalizable representation for many downstream tasks without using inaccessible labels. However, existing approaches usually adopt the models originally designed for other domains (e.g., computer vision) to encode the time series data and rely on strong assumptions to design learning objectives, which limits their ability to perform well. To deal with these problems, we propose a novel URL framework for multivariate time series by learning time-series-specific shapelet-based representation through a popular contrasting learning paradigm. To the best of our knowledge, this is the first work that explores the shapelet-based embedding in the unsupervised general-purpose representation learning. A unified shapelet-based encoder and a novel learning objective with multi-grained contrasting and multi-scale alignment are particularly designed to achieve our goal, and a data augmentation library is employed to improve the generalization. We conduct extensive experiments using tens of real-world datasets to assess the representation quality on many downstream tasks, including classification, clustering, and anomaly detection. The results demonstrate the superiority of our method against not only URL competitors, but also techniques specially designed for downstream tasks. Our code has been made publicly available at https://github.com/real2fish/CSL.
Symbolic Discovery of Optimization Algorithms
Chen, Xiangning, Liang, Chen, Huang, Da, Real, Esteban, Wang, Kaiyuan, Liu, Yao, Pham, Hieu, Dong, Xuanyi, Luong, Thang, Hsieh, Cho-Jui, Lu, Yifeng, Le, Quoc V.
We present a method to formulate algorithm discovery as program search, and apply it to discover optimization algorithms for deep neural network training. We leverage efficient search techniques to explore an infinite and sparse program space. To bridge the large generalization gap between proxy and target tasks, we also introduce program selection and simplification strategies. Our method discovers a simple and effective optimization algorithm, $\textbf{Lion}$ ($\textit{Evo$\textbf{L}$ved S$\textbf{i}$gn M$\textbf{o}$me$\textbf{n}$tum}$). It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion boosts the accuracy of ViT by up to 2% on ImageNet and saves up to 5x the pre-training compute on JFT. On vision-language contrastive learning, we achieve 88.3% $\textit{zero-shot}$ and 91.1% $\textit{fine-tuning}$ accuracy on ImageNet, surpassing the previous best results by 2% and 0.1%, respectively. On diffusion models, Lion outperforms Adam by achieving a better FID score and reducing the training compute by up to 2.3x. For autoregressive, masked language modeling, and fine-tuning, Lion exhibits a similar or better performance compared to Adam. Our analysis of Lion reveals that its performance gain grows with the training batch size. It also requires a smaller learning rate than Adam due to the larger norm of the update produced by the sign function. Additionally, we examine the limitations of Lion and identify scenarios where its improvements are small or not statistically significant. Lion is also successfully deployed in production systems such as Google search ads CTR model.
UniTS: A Universal Time Series Analysis Framework with Self-supervised Representation Learning
Liang, Zhiyu, Liang, Chen, Liang, Zheng, Wang, Hongzhi
Machine learning has emerged as a powerful tool for time series analysis. Existing methods are usually customized for different analysis tasks and face challenges in tackling practical problems such as partial labeling and domain shift. To achieve universal analysis and address the aforementioned problems, we develop UniTS, a novel framework that incorporates self-supervised representation learning (or pre-training). The components of UniTS are designed using sklearn-like APIs to allow flexible extensions. We demonstrate how users can easily perform an analysis task using the user-friendly GUIs, and show the superior performance of UniTS over the traditional task-specific methods without self-supervised pre-training on five mainstream tasks and two practical settings.