Country
Function Naming in Stripped Binaries Using Neural Networks
Artuso, Fiorella, Di Luna, Giuseppe Antonio, Massarelli, Luca, Querzoni, Leonardo
Abstract--In this paper we investigate the problem of automatically naming pieces of assembly code. Where by naming we mean assigning to portion of code the string of words that wou ld be likely assigned by an human reverse engineer . We formally and precisely define the framework in which our investigatio n takes place. That is we define problem, we provide reasonable justifications for the choice that we made during our designi ng of the training and test steps and we performed a statistical an alysis of function names in a large real-world corpora of over 4 mill ions of functions. In such framework we test several baselines co ming from the field of NLP (e.g., Seq2Seq networks and transformer s). Moreover, we provide a set of tailored solutions that beat th e aforementioned baselines. Last few years have witnessed the growth of a trend consisting in the application of machine learning (ML) and natural language processing (NLP) techniques to the code, as illustrated in [14].
Analyzing Privacy Loss in Updates of Natural Language Models
Tople, Shruti, Brockschmidt, Marc, Kรถpf, Boris, Ohrimenko, Olga, Zanella-Bรฉguelin, Santiago
To continuously improve quality and reflect changes in data, machine learning-based services have to regularly re-train and update their core models. In the setting of language models, we show that a comparative analysis of model snapshots before and after an update can reveal a surprising amount of detailed information about the changes in the data used for training before and after the update. We discuss the privacy implications of our findings, propose mitigation strategies and evaluate their effect.
Asynchronous Federated Learning with Differential Privacy for Edge Intelligence
Li, Yanan, Yang, Shusen, Ren, Xuebin, Zhao, Cong
Abstract--Federated learning has been showing as a promising approac h in paving the last mile of artificial intelligence, due to it s great potential of solving the data isolation problem in lar ge scale machine learning. Particularly, with considerati on of the heterogeneity in practical edge computing systems, asynchronous edge-cl oud collaboration based federated learning can further imp rove the learning efficiency by significantly reducing the straggler effect. Despite no raw data sharing, the open architecture a nd extensive collaborations of asynchronous federated learning (AFL) s till give some malicious participants great opportunities to infer other parties' training data, thus leading to serious concerns of privacy . T o achieve a rigorous privacy guarantee with high utility, w e investigate to secure asynchronous edge-cloud collaborative federated l earning with differential privacy, focusing on the impacts of differential privacy on model convergence of AFL. Formally, we give the first analy sis on the model convergence of AFL under DP and propose a multistage adjustable private algorithm (MAP A) to improv e the tradeoff between model utility and privacy by dynamic ally adjusting both the noise scale and the learning rate. Through extensiv e simulations and real-world experiments with an edge-coul d testbed, we demonstrate that MAP A significantly improves both the model accuracy and convergence speed with sufficient privacy guar antee. Index Terms --Distributed machine learning, Federated learning, Async hronous learning, Differential privacy, Convergence. However, with the increasing public awareness of privacy, more and more people are reluctant to provide their own data [7]- [9]. At the same time, large companies or organizations also begin to realize that the curated data is their coral assets with abundant business value [10], [11].
Incorporating Unlabeled Data into Distributionally Robust Learning
Frogner, Charlie, Claici, Sebastian, Chien, Edward, Solomon, Justin
We study a robust alternative to empirical risk minimization called distributionally robust learning (DRL), in which one learns to perform against an adversary who can choose the data distribution from a specified set of distributions. We illustrate a problem with current DRL formulations, which rely on an overly broad definition of allowed distributions for the adversary, leading to learned classifiers that are unable to predict with any confidence. We propose a solution that incorporates unlabeled data into the DRL problem to further constrain the adversary. We show that this new formulation is tractable for stochastic gradient-based optimization and yields a computable guarantee on the future performance of the learned classifier, analogous to -- but tighter than -- guarantees from conventional DRL. We examine the performance of this new formulation on 14 real datasets and find that it often yields effective classifiers with nontrivial performance guarantees in situations where conventional DRL produces neither. Inspired by these results, we extend our DRL formulation to active learning with a novel, distributionally-robust version of the standard model-change heuristic. Our active learning algorithm often achieves superior learning performance to the original heuristic on real datasets.
Scalability in Perception for Autonomous Driving: Waymo Open Dataset
Sun, Pei, Kretzschmar, Henrik, Dotiwalla, Xerxes, Chouard, Aurelien, Patnaik, Vijaysai, Tsui, Paul, Guo, James, Zhou, Yin, Chai, Yuning, Caine, Benjamin, Vasudevan, Vijay, Han, Wei, Ngiam, Jiquan, Zhao, Hang, Timofeev, Aleksei, Ettinger, Scott, Krivokon, Maxim, Gao, Amy, Joshi, Aditya, Zhang, Yu, Shlens, Jonathon, Chen, Zhifeng, Anguelov, Dragomir
The research community has increasing interest in autonomous driving research, despite the resource intensity of obtaining representative real world data. Existing self-driving datasets are limited in the scale and variation of the environments they capture, even though generalization within and between operating regions is crucial to the overall viability of the technology. In an effort to help align the research community's contributions with real-world self-driving problems, we introduce a new large scale, high quality, diverse dataset. Our new dataset consists of 1150 scenes that each span 20 seconds, consisting of well synchronized and calibrated high quality LiDAR and camera data captured across a range of urban and suburban geographies. It is 15x more diverse than the largest camera+LiDAR dataset available based on our proposed diversity metric. We exhaustively annotated this data with 2D (camera image) and 3D (LiDAR) bounding boxes, with consistent identifiers across frames. Finally, we provide strong baselines for 2D as well as 3D detection and tracking tasks. We further study the effects of dataset size and generalization across geographies on 3D detection methods. Find data, code and more up-to-date information at http://www.waymo.com/open.
Uncovering Relations for Marketing Knowledge Representation
Online behaviors of consumers and marketers generate massive marketing data, which ever more sophisticated models attempt to turn into insights and aid decisions by marketers. Yet, in making decisions human managers bring to bear marketing knowledge which reside outside of data and models. Thus, it behooves creation of an automated marketing knowledge base that can interact with data and models. Currently, marketing knowledge is dispersed in large corpora, but no definitive knowledge base for marketing exists. Out of the two broad aspects of marketing knowledge - representation and reasoning - this treatise focuses on the former. Specifically, we focus on creation of marketing knowledge graph from corpora, which requires identification of entities and relations. The relation identification task is particularly challenging in marketing, because of the non-factoid nature of much marketing knowledge, and the difficulty of forming rules that govern relations. Specifically, we define a set of relations to capture marketing knowledge, propose a pipeline for creating the knowledge graph from text and propose a rule-guided semi-supervised relation prediction algorithm to extract relations between marketing entities from sentences.
Causality matters in medical imaging
Castro, Daniel C., Walker, Ian, Glocker, Ben
This article discusses how the language of causality can shed new light on the major challenges in machine learning for medical imaging: 1) data scarcity, which is the limited availability of high-quality annotations, and 2) data mismatch, whereby a trained algorithm may fail to generalize in clinical practice. Looking at these challenges through the lens of causality allows decisions about data collection, annotation procedures, and learning strategies to be made (and scrutinized) more transparently. We discuss how causal relationships between images and annotations can not only have profound effects on the performance of predictive models, but may even dictate which learning strategies should be considered in the first place. For example, we conclude that semi-supervision may be unsuitable for image segmentation---one of the possibly surprising insights from our causal analysis, which is illustrated with representative real-world examples of computer-aided diagnosis (skin lesion classification in dermatology) and radiotherapy (automated contouring of tumours). We highlight that being aware of and accounting for the causal relationships in medical imaging data is important for the safe development of machine learning and essential for regulation and responsible reporting. To facilitate this we provide step-by-step recommendations for future studies.
An Embarrassingly Simple Baseline for eXtreme Multi-label Prediction
The goal of eXtreme Multi-label Learning (XML) is to design and learn a model that can automatically annotate a given data point with the most relevant subset of labels from an extremely large label set. Recently, many techniques have been proposed for XML that achieve reasonable performance on benchmark datasets. Motivated by the complexities of these methods and their subsequent training requirements, in this paper we propose a simple baseline technique for this task. Precisely, we present a global feature embedding technique for XML that can easily scale to very large datasets containing millions of data points in very high-dimensional feature space, irrespective of number of samples and labels. Next we show how an ensemble of such global embeddings can be used to achieve further boost in prediction accuracies with only linear increase in training and prediction time. During testing, we assign the labels using a weighted k-nearest neighbour classifier in the embedding space. Experiments reveal that though conceptually simple, this technique achieves quite competitive results, and has training time of less than one minute using a single CPU core with 15.6 GB RAM even for large-scale datasets such as Amazon-3M.
Putting Ridesharing to the Test: Efficient and Scalable Solutions and the Power of Dynamic Vehicle Relocation
Danassis, Panayiotis, Sakota, Marija, Filos-Ratsikas, Aris, Faltings, Boi
Ridesharing is a coordination problem in its core. Traditionally it has been solved in a centralized manner by ridesharing platforms. Yet, to truly allow for scalable solutions, we needs to shift from traditional approaches, to multi-agent systems, ideally run on-device. In this paper, we show that a recently proposed heuristic (ALMA), which exhibits such properties, offers an efficient, end-to-end solution for the ridesharing problem. Moreover, by utilizing simple relocation schemes we significantly improve QoS metrics, by up to 50%. To demonstrate the latter, we perform a systematic evaluation of a diverse set of algorithms for the ridesharing problem, which is, to the best of our knowledge, one of the largest and most comprehensive to date. Our evaluation setting is specifically designed to resemble reality as closely as possible. In particular, we evaluate 12 different algorithms over 12 metrics related to global efficiency, complexity, passenger, driver, and platform incentives.