AITopics

Abstract--In this paper we investigate the problem of automatically naming pieces of assembly code. Where by naming we mean assigning to portion of code the string of words that wou ld be likely assigned by an human reverse engineer . We formally and precisely define the framework in which our investigatio n takes place. That is we define problem, we provide reasonable justifications for the choice that we made during our designi ng of the training and test steps and we performed a statistical an alysis of function names in a large real-world corpora of over 4 mill ions of functions. In such framework we test several baselines co ming from the field of NLP (e.g., Seq2Seq networks and transformer s). Moreover, we provide a set of tailored solutions that beat th e aforementioned baselines. Last few years have witnessed the growth of a trend consisting in the application of machine learning (ML) and natural language processing (NLP) techniques to the code, as illustrated in [14].

dataset, function name, instruction, (17 more...)

1912.07946

Country: Africa > Middle East > Egypt > Aswan Governorate > Aswan (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Tople, Shruti, Brockschmidt, Marc, Köpf, Boris, Ohrimenko, Olga, Zanella-Béguelin, Santiago

Analyzing Privacy Loss in Updates of Natural Language Models

To continuously improve quality and reflect changes in data, machine learning-based services have to regularly re-train and update their core models. In the setting of language models, we show that a comparative analysis of model snapshots before and after an update can reveal a surprising amount of detailed information about the changes in the data used for training before and after the update. We discuss the privacy implications of our findings, propose mitigation strategies and evaluate their effect.

dataset, differential score, sequence, (17 more...)

1912.07942

Country:

Asia > Middle East > Republic of Türkiye (0.05)
North America > United States > Minnesota (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment > Sports > Hockey (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Asynchronous Federated Learning with Differential Privacy for Edge Intelligence

Li, Yanan, Yang, Shusen, Ren, Xuebin, Zhao, Cong

Abstract--Federated learning has been showing as a promising approac h in paving the last mile of artificial intelligence, due to it s great potential of solving the data isolation problem in lar ge scale machine learning. Particularly, with considerati on of the heterogeneity in practical edge computing systems, asynchronous edge-cl oud collaboration based federated learning can further imp rove the learning efficiency by significantly reducing the straggler effect. Despite no raw data sharing, the open architecture a nd extensive collaborations of asynchronous federated learning (AFL) s till give some malicious participants great opportunities to infer other parties' training data, thus leading to serious concerns of privacy . T o achieve a rigorous privacy guarantee with high utility, w e investigate to secure asynchronous edge-cloud collaborative federated l earning with differential privacy, focusing on the impacts of differential privacy on model convergence of AFL. Formally, we give the first analy sis on the model convergence of AFL under DP and propose a multistage adjustable private algorithm (MAP A) to improv e the tradeoff between model utility and privacy by dynamic ally adjusting both the noise scale and the learning rate. Through extensiv e simulations and real-world experiments with an edge-coul d testbed, we demonstrate that MAP A significantly improves both the model accuracy and convergence speed with sufficient privacy guar antee. Index Terms --Distributed machine learning, Federated learning, Async hronous learning, Differential privacy, Convergence. However, with the increasing public awareness of privacy, more and more people are reluctant to provide their own data [7]- [9]. At the same time, large companies or organizations also begin to realize that the curated data is their coral assets with abundant business value [10], [11].

cloud server, edge server, server, (13 more...)

1912.07902

Country:

North America > United States (0.14)
Asia > China > Shaanxi Province > Xi'an (0.04)
Europe > United Kingdom (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Frogner, Charlie, Claici, Sebastian, Chien, Edward, Solomon, Justin

Incorporating Unlabeled Data into Distributionally Robust Learning

We study a robust alternative to empirical risk minimization called distributionally robust learning (DRL), in which one learns to perform against an adversary who can choose the data distribution from a specified set of distributions. We illustrate a problem with current DRL formulations, which rely on an overly broad definition of allowed distributions for the adversary, leading to learned classifiers that are unable to predict with any confidence. We propose a solution that incorporates unlabeled data into the DRL problem to further constrain the adversary. We show that this new formulation is tractable for stochastic gradient-based optimization and yields a computable guarantee on the future performance of the learned classifier, analogous to -- but tighter than -- guarantees from conventional DRL. We examine the performance of this new formulation on 14 real datasets and find that it often yields effective classifiers with nontrivial performance guarantees in situations where conventional DRL produces neither. Inspired by these results, we extend our DRL formulation to active learning with a novel, distributionally-robust version of the standard model-change heuristic. Our active learning algorithm often achieves superior learning performance to the original heuristic on real datasets.

data distribution, incorporating unlabeled data, learning, (14 more...)

1912.07729

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > Wisconsin (0.04)
North America > United States > New York (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.92)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Scalability in Perception for Autonomous Driving: Waymo Open Dataset

Sun, Pei, Kretzschmar, Henrik, Dotiwalla, Xerxes, Chouard, Aurelien, Patnaik, Vijaysai, Tsui, Paul, Guo, James, Zhou, Yin, Chai, Yuning, Caine, Benjamin, Vasudevan, Vijay, Han, Wei, Ngiam, Jiquan, Zhao, Hang, Timofeev, Aleksei, Ettinger, Scott, Krivokon, Maxim, Gao, Amy, Joshi, Aditya, Zhang, Yu, Shlens, Jonathon, Chen, Zhifeng, Anguelov, Dragomir

The research community has increasing interest in autonomous driving research, despite the resource intensity of obtaining representative real world data. Existing self-driving datasets are limited in the scale and variation of the environments they capture, even though generalization within and between operating regions is crucial to the overall viability of the technology. In an effort to help align the research community's contributions with real-world self-driving problems, we introduce a new large scale, high quality, diverse dataset. Our new dataset consists of 1150 scenes that each span 20 seconds, consisting of well synchronized and calibrated high quality LiDAR and camera data captured across a range of urban and suburban geographies. It is 15x more diverse than the largest camera+LiDAR dataset available based on our proposed diversity metric. We exhaustively annotated this data with 2D (camera image) and 3D (LiDAR) bounding boxes, with consistent identifiers across frames. Finally, we provide strong baselines for 2D as well as 3D detection and tracking tasks. We further study the effects of dataset size and generalization across geographies on 3D detection methods. Find data, code and more up-to-date information at http://www.waymo.com/open.

camera image, coordinate system, dataset, (13 more...)

1912.04838

Country: North America > United States > California > San Francisco County > San Francisco (0.05)

Genre: Research Report (0.82)

Industry:

Transportation > Ground > Road (0.86)
Information Technology > Robotics & Automation (0.86)
Automobiles & Trucks (0.86)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)

Aditya, Somak, Sinha, Atanu

Uncovering Relations for Marketing Knowledge Representation

Online behaviors of consumers and marketers generate massive marketing data, which ever more sophisticated models attempt to turn into insights and aid decisions by marketers. Yet, in making decisions human managers bring to bear marketing knowledge which reside outside of data and models. Thus, it behooves creation of an automated marketing knowledge base that can interact with data and models. Currently, marketing knowledge is dispersed in large corpora, but no definitive knowledge base for marketing exists. Out of the two broad aspects of marketing knowledge - representation and reasoning - this treatise focuses on the former. Specifically, we focus on creation of marketing knowledge graph from corpora, which requires identification of entities and relations. The relation identification task is particularly challenging in marketing, because of the non-factoid nature of much marketing knowledge, and the difficulty of forming rules that govern relations. Specifically, we define a set of relations to capture marketing knowledge, propose a pipeline for creating the knowledge graph from text and propose a rule-guided semi-supervised relation prediction algorithm to extract relations between marketing entities from sentences.

knowledge, knowledge graph, relation, (15 more...)

1912.08374

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Germany > Berlin (0.04)
Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)

Genre: Research Report (0.40)

Industry: Marketing (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.91)
(2 more...)

srlearn: A Python Library for Gradient-Boosted Statistical Relational Models

Hayes, Alexander L.

We present srlearn, a Python library for boosted statistical relational models. We adapt the scikit-learn interface to this setting and provide examples for how this can be used to express learning and inference problems.

hyperparameter, learning, srlearn, (11 more...)

1912.08198

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > Indiana (0.05)
North America > United States > Texas (0.04)
(2 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.50)

Castro, Daniel C., Walker, Ian, Glocker, Ben

Causality matters in medical imaging

This article discusses how the language of causality can shed new light on the major challenges in machine learning for medical imaging: 1) data scarcity, which is the limited availability of high-quality annotations, and 2) data mismatch, whereby a trained algorithm may fail to generalize in clinical practice. Looking at these challenges through the lens of causality allows decisions about data collection, annotation procedures, and learning strategies to be made (and scrutinized) more transparently. We discuss how causal relationships between images and annotations can not only have profound effects on the performance of predictive models, but may even dictate which learning strategies should be considered in the first place. For example, we conclude that semi-supervision may be unsuitable for image segmentation---one of the possibly surprising insights from our causal analysis, which is illustrated with representative real-world examples of computer-aided diagnosis (skin lesion classification in dermatology) and radiotherapy (automated contouring of tumours). We highlight that being aware of and accounting for the causal relationships in medical imaging data is important for the safe development of machine learning and essential for regulation and responsible reporting. To facilitate this we provide step-by-step recommendations for future studies.

assumption, diagram, medical imaging, (15 more...)

1912.08142

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
(2 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Dermatology (1.00)
Health & Medicine > Health Care Technology (1.00)
(2 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

An Embarrassingly Simple Baseline for eXtreme Multi-label Prediction

Verma, Yashaswi

The goal of eXtreme Multi-label Learning (XML) is to design and learn a model that can automatically annotate a given data point with the most relevant subset of labels from an extremely large label set. Recently, many techniques have been proposed for XML that achieve reasonable performance on benchmark datasets. Motivated by the complexities of these methods and their subsequent training requirements, in this paper we propose a simple baseline technique for this task. Precisely, we present a global feature embedding technique for XML that can easily scale to very large datasets containing millions of data points in very high-dimensional feature space, irrespective of number of samples and labels. Next we show how an ensemble of such global embeddings can be used to achieve further boost in prediction accuracies with only linear increase in training and prediction time. During testing, we assign the labels using a weighted k-nearest neighbour classifier in the embedding space. Experiments reveal that though conceptually simple, this technique achieves quite competitive results, and has training time of less than one minute using a single CPU core with 15.6 GB RAM even for large-scale datasets such as Amazon-3M.

dataset, en-rp, lpsr-nb parabel dismec pd-sparse ppd-sparse, (11 more...)

1912.0814

Country:

Asia > India (0.04)
North America > United States (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Danassis, Panayiotis, Sakota, Marija, Filos-Ratsikas, Aris, Faltings, Boi

Putting Ridesharing to the Test: Efficient and Scalable Solutions and the Power of Dynamic Vehicle Relocation

Ridesharing is a coordination problem in its core. Traditionally it has been solved in a centralized manner by ridesharing platforms. Yet, to truly allow for scalable solutions, we needs to shift from traditional approaches, to multi-agent systems, ideally run on-device. In this paper, we show that a recently proposed heuristic (ALMA), which exhibits such properties, offers an efficient, end-to-end solution for the ridesharing problem. Moreover, by utilizing simple relocation schemes we significantly improve QoS metrics, by up to 50%. To demonstrate the latter, we perform a systematic evaluation of a diverse set of algorithms for the ridesharing problem, which is, to the best of our knowledge, one of the largest and most comprehensive to date. Our evaluation setting is specifically designed to resemble reality as closely as possible. In particular, we evaluate 12 different algorithms over 12 metrics related to global efficiency, complexity, passenger, driver, and platform incentives.

algorithm, manhattan, sd time, (15 more...)

1912.08066

Country:

North America > United States > New York > Richmond County > New York City (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Merseyside > Liverpool (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)