Goto

Collaborating Authors

 criteo





A APPENDIX A.1 Data Preparation

Neural Information Processing Systems

Criteo are whether the user has clicked the item or not. Table 4: Statistics of the used datasets. A smaller relative ranking means a more important cross-feature. Figure 7: Training curve of baselines and the searched architecture of our PROFIT. We also present the training curve of the searched model and human-designed models in Figure 7.


Appendix for TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets Chengrun Y ang 1, Gabriel Bender

Neural Information Processing Systems

Due to the high costs involved, many works have proposed different methods to reduce the search cost. The first strategy is to reduce the time needed to evaluate each architecture seen during a search. The second strategy is to reduce the number of architectures we need to evaluate during a search. Resource constraints are prevalent in deep learning. Finding architectures with outstanding performance and low costs are important to both NAS research and application.


Unified Embedding: Battle-Tested Feature Representations for Web-Scale ML Systems

Coleman, Benjamin, Kang, Wang-Cheng, Fahrbach, Matthew, Wang, Ruoxi, Hong, Lichan, Chi, Ed H., Cheng, Derek Zhiyuan

arXiv.org Artificial Intelligence

Learning high-quality feature embeddings efficiently and effectively is critical for the performance of web-scale machine learning systems. A typical model ingests hundreds of features with vocabularies on the order of millions to billions of tokens. The standard approach is to represent each feature value as a d-dimensional embedding, introducing hundreds of billions of parameters for extremely high-cardinality features. This bottleneck has led to substantial progress in alternative embedding algorithms. Many of these methods, however, make the assumption that each feature uses an independent embedding table. This work introduces a simple yet highly effective framework, Feature Multiplexing, where one single representation space is used across many different categorical features. Our theoretical and empirical analysis reveals that multiplexed embeddings can be decomposed into components from each constituent feature, allowing models to distinguish between features. We show that multiplexed representations lead to Pareto-optimal parameter-accuracy tradeoffs for three public benchmark datasets. Further, we propose a highly practical approach called Unified Embedding with three major benefits: simplified feature configuration, strong adaptation to dynamic data distributions, and compatibility with modern hardware. Unified embedding gives significant improvements in offline and online metrics compared to highly competitive baselines across five web-scale search, ads, and recommender systems, where it serves billions of users across the world in industry-leading products.


TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets

Yang, Chengrun, Bender, Gabriel, Liu, Hanxiao, Kindermans, Pieter-Jan, Udell, Madeleine, Lu, Yifeng, Le, Quoc, Huang, Da

arXiv.org Artificial Intelligence

The best neural architecture for a given machine learning problem depends on many factors: not only the complexity and structure of the dataset, but also on resource constraints including latency, compute, energy consumption, etc. Neural architecture search (NAS) for tabular datasets is an important but under-explored problem. Previous NAS algorithms designed for image search spaces incorporate resource constraints directly into the reinforcement learning (RL) rewards. However, for NAS on tabular datasets, this protocol often discovers suboptimal architectures. This paper develops TabNAS, a new and more effective approach to handle resource constraints in tabular NAS using an RL controller motivated by the idea of rejection sampling. TabNAS immediately discards any architecture that violates the resource constraints without training or learning from that architecture. TabNAS uses a Monte-Carlo-based correction to the RL policy gradient update to account for this extra filtering step. Results on several tabular datasets demonstrate the superiority of TabNAS over previous reward-shaping methods: it finds better models that obey the constraints.


Beyond Point Estimate: Inferring Ensemble Prediction Variation from Neuron Activation Strength in Recommender Systems

Chen, Zhe, Wang, Yuyan, Lin, Dong, Cheng, Derek Zhiyuan, Hong, Lichan, Chi, Ed H., Cui, Claire

arXiv.org Machine Learning

Despite deep neural network (DNN)'s impressive prediction performance in various domains, it is well known now that a set of DNN models trained with the same model specification and the same data can produce very different prediction results. Ensemble method is one state-of-the-art benchmark for prediction uncertainty estimation. However, ensembles are expensive to train and serve for web-scale traffic. In this paper, we seek to advance the understanding of prediction variation estimated by the ensemble method. Through empirical experiments on two widely used benchmark datasets MovieLens and Criteo in recommender systems, we observe that prediction variations come from various randomness sources, including training data shuffling, and parameter random initialization. By introducing more randomness into model training, we notice that ensemble's mean predictions tend to be more accurate while the prediction variations tend to be higher. Moreover, we propose to infer prediction variation from neuron activation strength and demonstrate the strong prediction power from activation strength features. Our experiment results show that the average R squared on MovieLens is as high as 0.56 and on Criteo is 0.81. Our method performs especially well when detecting the lowest and highest variation buckets, with 0.92 AUC and 0.89 AUC respectively. Our approach provides a simple way for prediction variation estimation, which opens up new opportunities for future work in many interesting areas (e.g.,model-based reinforcement learning) without relying on serving expensive ensemble models.


r/MachineLearning - [N] Laplace's Demon: A Seminar Series about Bayesian Machine Learning at Scale

#artificialintelligence

We have recently launched an ongoing online seminar series about Bayesian machine learning as scale. The intended audience includes machine learning practitioners and statisticians from academia and industry. Registration is now open for Jake Hofman's 17 June talk: "How visualizing inferential uncertainty can mislead readers about treatment effects in scientific results". Jake is a Senior Principal Researcher at Microsoft Research, New York. The talk is at 15.00 UTC this Wednesday, June 17; to see it in your local time zone please go to the registration page.


Optimization Approaches for Counterfactual Risk Minimization with Continuous Actions

Zenati, Houssam, Bietti, Alberto, Martin, Matthieu, Diemert, Eustache, Mairal, Julien

arXiv.org Machine Learning

Counterfactual reasoning from logged data has become increasingly important for a large range of applications such as web advertising or healthcare. In this paper, we address the problem of counterfactual risk minimization for learning a stochastic policy with a continuous action space. Whereas previous works have mostly focused on deriving statistical estimators with importance sampling, we show that the optimization perspective is equally important for solving the resulting nonconvex optimization problems.Specifically, we demonstrate the benefits of proximal point algorithms and soft-clipping estimators which are more amenable to gradient-based optimization than classical hard clipping. We propose multiple synthetic, yet realistic, evaluation setups, and we release a new large-scale dataset based on web advertising data for this problem that is crucially missing public benchmarks.