Country
A Finite-Sample Deviation Bound for Stable Autoregressive Processes
González, Rodrigo A., Rojas, Cristian R.
In this paper, we study non-asymptotic deviation bounds of the least squares estimator in Gaussian AR($n$) processes. By relying on martingale concentration inequalities and a tail-bound for $\chi^2$ distributed variables, we provide a concentration bound for the sample covariance matrix of the process output. With this, we present a problem-dependent finite-time bound on the deviation probability of any fixed linear combination of the estimated parameters of the AR$(n)$ process. We discuss extensions and limitations of our approach.
Sim-to-Real Domain Adaptation For High Energy Physics
Baalouch, Marouen, Defurne, Maxime, Poli, Jean-Philippe, Cherrier, Noëlie
Particle physics or High Energy Physics (HEP) studies the elementary constituents of matter and their interactions with each other. Machine Learning (ML) has played an important role in HEP analysis and has proven extremely successful in this area. Usually, the ML algorithms are trained on numerical simulations of the experimental setup and then applied to the real experimental data. However, any discrepancy between the simulation and real data may lead to dramatic consequences concerning the performances of the algorithm on real data. In this paper, we present an application of domain adaptation using a Domain Adversarial Neural Network trained on public HEP data. We demonstrate the success of this approach to achieve sim-to-real transfer and ensure the consistency of the ML algorithms performances on real and simulated HEP datasets.
Embedded Constrained Feature Construction for High-Energy Physics Data Classification
Cherrier, Noëlie, Defurne, Maxime, Poli, Jean-Philippe, Sabatié, Franck
Before any publication, data analysis of high-energy physics experiments must be validated. This validation is granted only if a perfect understanding of the data and the analysis process is demonstrated. Therefore, physicists prefer using transparent machine learning algorithms whose performances highly rely on the suitability of the provided input features. To transform the feature space, feature construction aims at automatically generating new relevant features. Whereas most of previous works in this area perform the feature construction prior to the model training, we propose here a general framework to embed a feature construction technique adapted to the constraints of high-energy physics in the induction of tree-based models. Experiments on two high-energy physics datasets confirm that a significant gain is obtained on the classification scores, while limiting the number of built features. Since the features are built to be interpretable, the whole model is transparent and readable.
Function Naming in Stripped Binaries Using Neural Networks
Artuso, Fiorella, Di Luna, Giuseppe Antonio, Massarelli, Luca, Querzoni, Leonardo
Abstract--In this paper we investigate the problem of automatically naming pieces of assembly code. Where by naming we mean assigning to portion of code the string of words that wou ld be likely assigned by an human reverse engineer . We formally and precisely define the framework in which our investigatio n takes place. That is we define problem, we provide reasonable justifications for the choice that we made during our designi ng of the training and test steps and we performed a statistical an alysis of function names in a large real-world corpora of over 4 mill ions of functions. In such framework we test several baselines co ming from the field of NLP (e.g., Seq2Seq networks and transformer s). Moreover, we provide a set of tailored solutions that beat th e aforementioned baselines. Last few years have witnessed the growth of a trend consisting in the application of machine learning (ML) and natural language processing (NLP) techniques to the code, as illustrated in [14].
Analyzing Privacy Loss in Updates of Natural Language Models
Tople, Shruti, Brockschmidt, Marc, Köpf, Boris, Ohrimenko, Olga, Zanella-Béguelin, Santiago
To continuously improve quality and reflect changes in data, machine learning-based services have to regularly re-train and update their core models. In the setting of language models, we show that a comparative analysis of model snapshots before and after an update can reveal a surprising amount of detailed information about the changes in the data used for training before and after the update. We discuss the privacy implications of our findings, propose mitigation strategies and evaluate their effect.
Asynchronous Federated Learning with Differential Privacy for Edge Intelligence
Li, Yanan, Yang, Shusen, Ren, Xuebin, Zhao, Cong
Abstract--Federated learning has been showing as a promising approac h in paving the last mile of artificial intelligence, due to it s great potential of solving the data isolation problem in lar ge scale machine learning. Particularly, with considerati on of the heterogeneity in practical edge computing systems, asynchronous edge-cl oud collaboration based federated learning can further imp rove the learning efficiency by significantly reducing the straggler effect. Despite no raw data sharing, the open architecture a nd extensive collaborations of asynchronous federated learning (AFL) s till give some malicious participants great opportunities to infer other parties' training data, thus leading to serious concerns of privacy . T o achieve a rigorous privacy guarantee with high utility, w e investigate to secure asynchronous edge-cloud collaborative federated l earning with differential privacy, focusing on the impacts of differential privacy on model convergence of AFL. Formally, we give the first analy sis on the model convergence of AFL under DP and propose a multistage adjustable private algorithm (MAP A) to improv e the tradeoff between model utility and privacy by dynamic ally adjusting both the noise scale and the learning rate. Through extensiv e simulations and real-world experiments with an edge-coul d testbed, we demonstrate that MAP A significantly improves both the model accuracy and convergence speed with sufficient privacy guar antee. Index Terms --Distributed machine learning, Federated learning, Async hronous learning, Differential privacy, Convergence. However, with the increasing public awareness of privacy, more and more people are reluctant to provide their own data [7]- [9]. At the same time, large companies or organizations also begin to realize that the curated data is their coral assets with abundant business value [10], [11].
Incorporating Unlabeled Data into Distributionally Robust Learning
Frogner, Charlie, Claici, Sebastian, Chien, Edward, Solomon, Justin
We study a robust alternative to empirical risk minimization called distributionally robust learning (DRL), in which one learns to perform against an adversary who can choose the data distribution from a specified set of distributions. We illustrate a problem with current DRL formulations, which rely on an overly broad definition of allowed distributions for the adversary, leading to learned classifiers that are unable to predict with any confidence. We propose a solution that incorporates unlabeled data into the DRL problem to further constrain the adversary. We show that this new formulation is tractable for stochastic gradient-based optimization and yields a computable guarantee on the future performance of the learned classifier, analogous to -- but tighter than -- guarantees from conventional DRL. We examine the performance of this new formulation on 14 real datasets and find that it often yields effective classifiers with nontrivial performance guarantees in situations where conventional DRL produces neither. Inspired by these results, we extend our DRL formulation to active learning with a novel, distributionally-robust version of the standard model-change heuristic. Our active learning algorithm often achieves superior learning performance to the original heuristic on real datasets.
Scalability in Perception for Autonomous Driving: Waymo Open Dataset
Sun, Pei, Kretzschmar, Henrik, Dotiwalla, Xerxes, Chouard, Aurelien, Patnaik, Vijaysai, Tsui, Paul, Guo, James, Zhou, Yin, Chai, Yuning, Caine, Benjamin, Vasudevan, Vijay, Han, Wei, Ngiam, Jiquan, Zhao, Hang, Timofeev, Aleksei, Ettinger, Scott, Krivokon, Maxim, Gao, Amy, Joshi, Aditya, Zhang, Yu, Shlens, Jonathon, Chen, Zhifeng, Anguelov, Dragomir
The research community has increasing interest in autonomous driving research, despite the resource intensity of obtaining representative real world data. Existing self-driving datasets are limited in the scale and variation of the environments they capture, even though generalization within and between operating regions is crucial to the overall viability of the technology. In an effort to help align the research community's contributions with real-world self-driving problems, we introduce a new large scale, high quality, diverse dataset. Our new dataset consists of 1150 scenes that each span 20 seconds, consisting of well synchronized and calibrated high quality LiDAR and camera data captured across a range of urban and suburban geographies. It is 15x more diverse than the largest camera+LiDAR dataset available based on our proposed diversity metric. We exhaustively annotated this data with 2D (camera image) and 3D (LiDAR) bounding boxes, with consistent identifiers across frames. Finally, we provide strong baselines for 2D as well as 3D detection and tracking tasks. We further study the effects of dataset size and generalization across geographies on 3D detection methods. Find data, code and more up-to-date information at http://www.waymo.com/open.
Uncovering Relations for Marketing Knowledge Representation
Online behaviors of consumers and marketers generate massive marketing data, which ever more sophisticated models attempt to turn into insights and aid decisions by marketers. Yet, in making decisions human managers bring to bear marketing knowledge which reside outside of data and models. Thus, it behooves creation of an automated marketing knowledge base that can interact with data and models. Currently, marketing knowledge is dispersed in large corpora, but no definitive knowledge base for marketing exists. Out of the two broad aspects of marketing knowledge - representation and reasoning - this treatise focuses on the former. Specifically, we focus on creation of marketing knowledge graph from corpora, which requires identification of entities and relations. The relation identification task is particularly challenging in marketing, because of the non-factoid nature of much marketing knowledge, and the difficulty of forming rules that govern relations. Specifically, we define a set of relations to capture marketing knowledge, propose a pipeline for creating the knowledge graph from text and propose a rule-guided semi-supervised relation prediction algorithm to extract relations between marketing entities from sentences.