Jiang, Shali
External Large Foundation Model: How to Efficiently Serve Trillions of Parameters for Online Ads Recommendation
Liang, Mingfu, Liu, Xi, Jin, Rong, Liu, Boyang, Suo, Qiuling, Zhou, Qinghai, Zhou, Song, Chen, Laming, Zheng, Hua, Li, Zhiyuan, Jiang, Shali, Yang, Jiyan, Xia, Xiaozhen, Yang, Fan, Badr, Yasmine, Wen, Ellie, Xu, Shuyu, Chen, Hansey, Zhang, Zhengyu, Nie, Jade, Yang, Chunzhi, Zeng, Zhichen, Zhang, Weilin, Huang, Xingliang, Li, Qianru, Wang, Shiquan, Lyu, Evelyn, Lu, Wenjing, Zhang, Rui, Wang, Wenjun, Rudy, Jason, Hang, Mengyue, Wang, Kai, Ma, Yinbin, Wang, Shuaiwen, Zeng, Sihan, Tang, Tongyi, Wei, Xiaohan, Jin, Longhao, Zhang, Jamey, Chen, Marcus, Zhang, Jiayi, Huang, Angie, Zhang, Chi, Zhao, Zhengli, Yang, Jared, Jin, Qiang, Chen, Xian, Amlesahwaram, Amit Anand, Song, Lexi, Luo, Liang, Hao, Yuchen, Xiao, Nan, Yetim, Yavuz, Pan, Luoshang, Liu, Gaoxiang, Hu, Yuxi, Huang, Yuzhen, Xu, Jackie, Zhu, Rich, Zhang, Xin, Liu, Yiqun, Yin, Hang, Chen, Yuxin, Zhang, Buyun, Liu, Xiaoyi, Wang, Xingyuan, Mao, Wenguang, Li, Zhijing, Huang, Qin, Sun, Chonglin, Yu, Nancy, Gu, Shuo, Mao, Shupin, Au, Benjamin, Qin, Jingzheng, Yao, Peggy, Choi, Jae-Woo, Gao, Bin, Wang, Ernest, Zhang, Lei, Chen, Wen-Yen, Lee, Ted, Zha, Jay, Meng, Yi, Gong, Alex, Gao, Edison, Vahdatpour, Alireza, Han, Yiping, Yao, Yantao, Kureha, Toshinari, Chang, Shuo, Sultan, Musharaf, Bocharov, John, Chordia, Sagar, Gan, Xiaorui, Sun, Peng, Liu, Rocky, Long, Bo, Chen, Wenlin, Kolay, Santanu, Li, Huayu
Ads recommendation is a prominent service of online advertising systems and has been actively studied. Recent studies indicate that scaling-up and advanced design of the recommendation model can bring significant performance improvement. However, with a larger model scale, such prior studies have a significantly increasing gap from industry as they often neglect two fundamental challenges in industrial-scale applications. First, training and inference budgets are restricted for the model to be served, exceeding which may incur latency and impair user experience. Second, large-volume data arrive in a streaming mode with data distributions dynamically shifting, as new users/ads join and existing users/ads leave the system. We propose the External Large Foundation Model (ExFM) framework to address the overlooked challenges. Specifically, we develop external distillation and a data augmentation system (DAS) to control the computational cost of training/inference while maintaining high performance. We design the teacher in a way like a foundation model (FM) that can serve multiple students as vertical models (VMs) to amortize its building cost. We propose Auxiliary Head and Student Adapter to mitigate the data distribution gap between FM and VMs caused by the streaming data issue. Comprehensive experiments on internal industrial-scale applications and public datasets demonstrate significant performance gain by ExFM.
AutoML for Large Capacity Modeling of Meta's Ranking Systems
Yin, Hang, Liu, Kuang-Hung, Sun, Mengying, Chen, Yuxin, Zhang, Buyun, Liu, Jiang, Sehgal, Vivek, Panchal, Rudresh Rajnikant, Hotaj, Eugen, Liu, Xi, Guo, Daifeng, Zhang, Jamey, Wang, Zhou, Jiang, Shali, Li, Huayu, Chen, Zhengxing, Chen, Wen-Yen, Yang, Jiyan, Wen, Wei
Web-scale ranking systems at Meta serving billions of users is complex. Improving ranking models is essential but engineering heavy. Automated Machine Learning (AutoML) can release engineers from labor intensive work of tuning ranking models; however, it is unknown if AutoML is efficient enough to meet tight production timeline in real-world and, at the same time, bring additional improvements to the strong baselines. Moreover, to achieve higher ranking performance, there is an ever-increasing demand to scale up ranking models to even larger capacity, which imposes more challenges on the efficiency. The large scale of models and tight production schedule requires AutoML to outperform human baselines by only using a small number of model evaluation trials (around 100). We presents a sampling-based AutoML method, focusing on neural architecture search and hyperparameter optimization, addressing these challenges in Meta-scale production when building large capacity models. Our approach efficiently handles large-scale data demands. It leverages a lightweight predictor-based searcher and reinforcement learning to explore vast search spaces, significantly reducing the number of model evaluations. Through experiments in large capacity modeling for CTR and CVR applications, we show that our method achieves outstanding Return on Investment (ROI) versus human tuned baselines, with up to 0.09% Normalized Entropy (NE) loss reduction or $25\%$ Query per Second (QPS) increase by only sampling one hundred models on average from a curated search space. The proposed AutoML method has already made real-world impact where a discovered Instagram CTR model with up to -0.36% NE gain (over existing production baseline) was selected for large-scale online A/B test and show statistically significant gain. These production results proved AutoML efficacy and accelerated its adoption in ranking systems at Meta.
Efficient Nonmyopic Bayesian Optimization via One-Shot Multi-Step Trees
Jiang, Shali, Jiang, Daniel R., Balandat, Maximilian, Karrer, Brian, Gardner, Jacob R., Garnett, Roman
Bayesian optimization is a sequential decision making framework for optimizing expensive-to-evaluate black-box functions. Computing a full lookahead policy amounts to solving a highly intractable stochastic dynamic program. Myopic approaches, such as expected improvement, are often adopted in practice, but they ignore the long-term impact of the immediate decision. Existing nonmyopic approaches are mostly heuristic and/or computationally expensive. In this paper, we provide the first efficient implementation of general multi-step lookahead Bayesian optimization, formulated as a sequence of nested optimization problems within a multi-step scenario tree. Instead of solving these problems in a nested way, we equivalently optimize all decision variables in the full tree jointly, in a ``one-shot'' fashion. Combining this with an efficient method for implementing multi-step Gaussian process ``fantasization,'' we demonstrate that multi-step expected improvement is computationally tractable and exhibits performance superior to existing methods on a wide range of benchmarks.
Efficient nonmyopic batch active search
Jiang, Shali, Malkomes, Gustavo, Abbott, Matthew, Moseley, Benjamin, Garnett, Roman
Active search is a learning paradigm for actively identifying as many members of a given class as possible. A critical target scenario is high-throughput screening for scientific discovery, such as drug or materials discovery. In these settings, specialized instruments can often evaluate \emph{multiple} points simultaneously; however, all existing work on active search focuses on sequential acquisition. We first derive the Bayesian optimal policy for this problem, then prove a lower bound on the performance gap between sequential and batch optimal policies: the cost of parallelization.'' We also propose novel, efficient batch policies inspired by state-of-the-art sequential policies, and develop an aggressive pruning technique that can dramatically speed up computation.
Efficient nonmyopic Bayesian optimization and quadrature
Jiang, Shali, Chai, Henry, Gonzalez, Javier, Garnett, Roman
Finite-horizon sequential decision problems arise naturally in many machine learning contexts; examples include Bayesian optimization and Bayesian quadrature. Computing the optimal policy for such problems requires solving Bellman equations, which are generally intractable. Most existing work resorts to myopic approximations by limiting the horizon to only a single time-step, which can perform poorly in balancing exploration and exploitation. We propose a general framework for efficient, nonmyopic approximation of the optimal policy by drawing a connection between the optimal adaptive policy and its non-adaptive counterpart. Our proposal is to compute an optimal batch of points, then select a single point from within this batch to evaluate. We realize this idea for both Bayesian optimization and Bayesian quadrature and demonstrate that our proposed method significantly outperforms common myopic alternatives on a variety of tasks.
D-VAE: A Variational Autoencoder for Directed Acyclic Graphs
Zhang, Muhan, Jiang, Shali, Cui, Zhicheng, Garnett, Roman, Chen, Yixin
Graph structured data are abundant in the real world. Among different graph types, directed acyclic graphs (DAGs) are of particular interests to machine learning researchers, as many machine learning models are realized as computations on DAGs, including neural networks and Bayesian networks. In this paper, we study deep generative models for DAGs, and propose a novel DAG variational autoencoder (D-VAE). To encode DAGs into the latent space, we leverage graph neural networks. We propose a DAG-style asynchronous message passing scheme that allows encoding the computations defined by DAGs, rather than using existing simultaneous message passing schemes to encode the graph structures. We demonstrate the effectiveness of our proposed D-VAE through two tasks: neural architecture search and Bayesian network structure learning. Experiments show that our model not only generates novel and valid DAGs, but also produces a smooth latent space that facilitates searching for DAGs with better performance through Bayesian optimization.
Efficient nonmyopic batch active search
Jiang, Shali, Malkomes, Gustavo, Abbott, Matthew, Moseley, Benjamin, Garnett, Roman
Active search is a learning paradigm for actively identifying as many members of a given class as possible. A critical target scenario is high-throughput screening for scientific discovery, such as drug or materials discovery. In these settings, specialized instruments can often evaluate \emph{multiple} points simultaneously; however, all existing work on active search focuses on sequential acquisition. We bridge this gap, addressing batch active search from both the theoretical and practical perspective. We first derive the Bayesian optimal policy for this problem, then prove a lower bound on the performance gap between sequential and batch optimal policies: the ``cost of parallelization.'' We also propose novel, efficient batch policies inspired by state-of-the-art sequential policies, and develop an aggressive pruning technique that can dramatically speed up computation. We conduct thorough experiments on data from three application domains: a citation network, material science, and drug discovery, testing all proposed policies (14 total) with a wide range of batch sizes. Our results demonstrate that the empirical performance gap matches our theoretical bound, that nonmyopic policies usually significantly outperform myopic alternatives, and that diversity is an important consideration for batch policy design.
Efficient nonmyopic batch active search
Jiang, Shali, Malkomes, Gustavo, Abbott, Matthew, Moseley, Benjamin, Garnett, Roman
Active search is a learning paradigm for actively identifying as many members of a given class as possible. A critical target scenario is high-throughput screening for scientific discovery, such as drug or materials discovery. In these settings, specialized instruments can often evaluate \emph{multiple} points simultaneously; however, all existing work on active search focuses on sequential acquisition. We bridge this gap, addressing batch active search from both the theoretical and practical perspective. We first derive the Bayesian optimal policy for this problem, then prove a lower bound on the performance gap between sequential and batch optimal policies: the ``cost of parallelization.'' We also propose novel, efficient batch policies inspired by state-of-the-art sequential policies, and develop an aggressive pruning technique that can dramatically speed up computation. We conduct thorough experiments on data from three application domains: a citation network, material science, and drug discovery, testing all proposed policies (14 total) with a wide range of batch sizes. Our results demonstrate that the empirical performance gap matches our theoretical bound, that nonmyopic policies usually significantly outperform myopic alternatives, and that diversity is an important consideration for batch policy design.
Efficient nonmyopic active search with applications in drug and materials discovery
Jiang, Shali, Malkomes, Gustavo, Moseley, Benjamin, Garnett, Roman
Active search is a learning paradigm for actively identifying as many members of a given class as possible. A critical target scenario is high-throughput screening for scientific discovery, such as drug or materials discovery. In this paper, we approach this problem in Bayesian decision framework. We first derive the Bayesian optimal policy under a natural utility, and establish a theoretical hardness of active search, proving that the optimal policy can not be approximated for any constant ratio. We also study the batch setting for the first time, where a batch of $b>1$ points can be queried at each iteration. We give an asymptotic lower bound, linear in batch size, on the adaptivity gap: how much we could lose if we query $b$ points at a time for $t$ iterations, instead of one point at a time for $bt$ iterations. We then introduce a novel approach to nonmyopic approximations of the optimal policy that admits efficient computation. Our proposed policy can automatically trade off exploration and exploitation, without relying on any tuning parameters. We also generalize our policy to batch setting, and propose two approaches to tackle the combinatorial search challenge. We evaluate our proposed policies on a large database of drug discovery and materials science. Results demonstrate the superior performance of our proposed policy in both sequential and batch setting; the nonmyopic behavior is also illustrated in various aspects.
Beyond Link Prediction: Predicting Hyperlinks in Adjacency Space
Zhang, Muhan (Washington University in St. Louis) | Cui, Zhicheng ( Washington University in St. Louis ) | Jiang, Shali ( Washington University in St. Louis ) | Chen, Yixin ( Washington University in St. Louis )
This paper addresses the hyperlink prediction problem in hypernetworks. Different from the traditional link prediction problem where only pairwise relations are considered as links, our task here is to predict the linkage of multiple nodes, i.e., hyperlink. Each hyperlink is a set of an arbitrary number of nodes which together form a multiway relationship. Hyperlink prediction is challenging---since the cardinality of a hyperlink is variable, existing classifiers based on a fixed number of input features become infeasible. Heuristic methods, such as the common neighbors and Katz index, do not work for hyperlink prediction, since they are restricted to pairwise similarities. In this paper, we formally define the hyperlink prediction problem, and propose a new algorithm called Coordinated Matrix Minimization (CMM), which alternately performs nonnegative matrix factorization and least square matching in the vertex adjacency space of the hypernetwork, in order to infer a subset of candidate hyperlinks that are most suitable to fill the training hypernetwork. We evaluate CMM on two novel tasks: predicting recipes of Chinese food, and finding missing reactions of metabolic networks. Experimental results demonstrate the superior performance of our method over many seemingly promising baselines.