Gao, Yue, Rosenberg, Harrison, Fawaz, Kassem, Jha, Somesh, Hsu, Justin

Recent advances in machine learning (ML) algorithms, especially deep neural networks (DNNs), have demonstrated remarkable success (sometimes exceeding human-level performance) on several tasks, including face and speech recognition. However, ML algorithms are vulnerable to \emph{adversarial attacks}, such test-time, training-time, and backdoor attacks. In test-time attacks an adversary crafts adversarial examples, which are specially crafted perturbations imperceptible to humans which, when added to an input example, force a machine learning model to misclassify the given input example. Adversarial examples are a concern when deploying ML algorithms in critical contexts, such as information security and autonomous driving. Researchers have responded with a plethora of defenses. One promising defense is \emph{randomized smoothing} in which a classifier's prediction is smoothed by adding random noise to the input example we wish to classify. In this paper, we theoretically and empirically explore randomized smoothing. We investigate the effect of randomized smoothing on the feasible hypotheses space, and show that for some noise levels the set of hypotheses which are feasible shrinks due to smoothing, giving one reason why the natural accuracy drops after smoothing. To perform our analysis, we introduce a model for randomized smoothing which abstracts away specifics, such as the exact distribution of the noise. We complement our theoretical results with extensive experiments.

In many applications, data is collected in batches, some of which are corrupt or even adversarial. Recent work derived optimal robust algorithms for estimating discrete distributions in this setting. We consider a general framework of robust learning from batches, and determine the limits of both classification and distribution estimation over arbitrary, including continuous, domains. Building on these results, we derive the first robust agnostic computationally-efficient learning algorithms for piecewise-interval classification, and for piecewise-polynomial, monotone, log-concave, and gaussian-mixture distribution estimation.

Ayanzadeh, Ramin, Halem, Milton, Finin, Tim

We introduce the reinforcement quantum annealing (RQA) scheme in which an intelligent agent interacts with a quantum annealer that plays the stochastic environment role of learning automata and tries to iteratively find better Ising Hamiltonians for the given problem of interest. As a proof-of-concept, we propose a novel approach for reducing the NP-complete problem of Boolean satisfiability (SAT) to minimizing Ising Hamiltonians and show how to apply the RQA for increasing the probability of finding the global optimum. Our experimental results on two different benchmark SAT problems (namely factoring pseudo-prime numbers and random SAT with phase transitions), using a D-Wave 2000Q quantum processor, demonstrated that RQA finds notably better solutions with fewer samples, compared to state-of-the-art techniques in the realm of quantum annealing.

Dhar, Sauptik, Guo, Junyao, Liu, Jiayi, Tripathi, Samarth, Kurup, Unmesh, Shah, Mohak

The current paradigm for using machine learning models on a device is to train a model in the cloud and perform inference using the trained model on the device. However, with the increasing number of smart devices and improved hardware, there is interest in performing model training on the device. Given this surge in interest, a comprehensive survey of the field from a device-agnostic perspective sets the stage for both understanding the state-of-the-art and for identifying open challenges and future avenues of research. Since on-device learning is an expansive field with connections to a large number of related topics in AI and machine learning (including online learning, model adaptation, one/few-shot learning, etc), covering such a large number of topics in a single survey is impractical. Instead, this survey finds a middle ground by reformulating the problem of on-device learning as resource constrained learning where the resources are compute and memory. This reformulation allows tools, techniques, and algorithms from a wide variety of research areas to be compared equitably. In addition to summarizing the state of the art, the survey also identifies a number of challenges and next steps for both the algorithmic and theoretical aspects of on-device learning.

Asadi, Behzad, Varadharajan, Vijay

We design a classifier for transactional datasets with application in malware detection. We build the classifier based on the minimum description length (MDL) principle. This involves selecting a model that best compresses the training dataset for each class considering the MDL criterion. To select a model for a dataset, we first use clustering followed by closed frequent pattern mining to extract a subset of closed frequent patterns (CFPs). We show that this method acts as a pattern summarization method to avoid pattern explosion; this is done by giving priority to longer CFPs, and without requiring to extract all CFPs. We then use the MDL criterion to further summarize extracted patterns, and construct a code table of patterns. This code table is considered as the selected model for the compression of the dataset. We evaluate our classifier for the problem of static malware detection in portable executable (PE) files. We consider API calls of PE files as their distinguishing features. The presence-absence of API calls forms a transactional dataset. Using our proposed method, we construct two code tables, one for the benign training dataset, and one for the malware training dataset. Our dataset consists of 19696 benign, and 19696 malware samples, each a binary sequence of size 22761. We compare our classifier with deep neural networks providing us with the state-of-the-art performance. The comparison shows that our classifier performs very close to deep neural networks. We also discuss that our classifier is an interpretable classifier. This provides the motivation to use this type of classifiers where some degree of explanation is required as to why a sample is classified under one class rather than the other class.

This is an up-to-date introduction to and overview of the Minimum Description Length (MDL) Principle, a theory of inductive inference that can be applied to general problems in statistics, machine learning and pattern recognition. While MDL was originally based on data compression ideas, this introduction can be read without any knowledge thereof. It takes into account all major developments since 2007, the last time an extensive overview was written. These include new methods for model selection and averaging and hypothesis testing, as well as the first completely general definition of {\em MDL estimators}. Incorporating these developments, MDL can be seen as a powerful extension of both penalized likelihood and Bayesian approaches, in which penalization functions and prior distributions are replaced by more general luckiness functions, average-case methodology is replaced by a more robust worst-case approach, and in which methods classically viewed as highly distinct, such as AIC vs BIC and cross-validation vs Bayes can, to a large extent, be viewed from a unified perspective.

Joy, Tinu Theckel, Rana, Santu, Gupta, Sunil, Venkatesh, Svetha

In this paper we develop a Bayesian optimization based hyperparameter tuning framework inspired by statistical learning theory for classifiers. We utilize two key facts from PAC learning theory; the generalization bound will be higher for a small subset of data compared to the whole, and the highest accuracy for a small subset of data can be achieved with a simple model. We initially tune the hyperparameters on a small subset of training data using Bayesian optimization. While tuning the hyperparameters on the whole training data, we leverage the insights from the learning theory to seek more complex models. We realize this by using directional derivative signs strategically placed in the hyperparameter search space to seek a more complex model than the one obtained with small data. We demonstrate the performance of our method on the tasks of tuning the hyperparameters of several machine learning algorithms.

Chandrasekaran, Varun, Chaudhuri, Kamalika, Giacomelli, Irene, Jha, Somesh, Yan, Songbai

Machine learning is being increasingly used by individuals, research institutions, and corporations. This has resulted in the surge of Machine Learning-as-a-Service (MLaaS) - cloud services that provide (a) tools and resources to learn the model, and (b) a user-friendly query interface to access the model. However, such MLaaS systems raise privacy concerns, one being model extraction. Adversaries maliciously exploit the query interface to steal the model. More precisely, in a model extraction attack, a good approximation of a sensitive or proprietary model held by the server is extracted (i.e. learned) by a dishonest user. Such a user only sees the answers to select queries sent using the query interface. This attack was recently introduced by Tramer et al. at the 2016 USENIX Security Symposium, where practical attacks for different models were shown. We believe that better understanding the efficacy of model extraction attacks is paramount in designing better privacy-preserving MLaaS systems. To that end, we take the first step by (a) formalizing model extraction and proposing the first definition of extraction defense, and (b) drawing parallels between model extraction and the better investigated active learning framework. In particular, we show that recent advancements in the active learning domain can be used to implement both model extraction, and defenses against such attacks.

Matrix factorization is a key tool in data analysis; its applications include recommender systems, correlation analysis, signal processing, among others. Binary matrices are a particular case which has received significant attention for over thirty years, especially within the field of data mining. Dictionary learning refers to a family of methods for learning overcomplete basis (also called frames) in order to efficiently encode samples of a given type; this area, now also about twenty years old, was mostly developed within the signal processing field. In this work we propose two binary matrix factorization methods based on a binary adaptation of the dictionary learning paradigm to binary matrices. The proposed algorithms focus on speed and scalability; they work with binary factors combined with bit-wise operations and a few auxiliary integer ones. Furthermore, the methods are readily applicable to online binary matrix factorization. Another important issue in matrix factorization is the choice of rank for the factors; we address this model selection problem with an efficient method based on the Minimum Description Length principle. Our preliminary results show that the proposed methods are effective at producing interpretable factorizations of various data types of different nature.

Constrained counting and sampling are two fundamental problems in Computer Science with numerous applications, including network reliability, privacy, probabilistic reasoning, and constrained-random verification. In constrained counting, the task is to compute the total weight, subject to a given weighting function, of the set of solutions of the given constraints. In constrained sampling, the task is to sample randomly, subject to a given weighting function, from the set of solutions to a set of given constraints. Consequently, constrained counting and sampling have been subject to intense theoretical and empirical investigations over the years. Prior work, however, offered either heuristic techniques with poor guarantees of accuracy or approaches with proven guarantees but poor performance in practice. In this thesis, we introduce a novel hashing-based algorithmic framework for constrained sampling and counting that combines the classical algorithmic technique of universal hashing with the dramatic progress made in combinatorial reasoning tools, in particular, SAT and SMT, over the past two decades. The resulting frameworks for counting (ApproxMC2) and sampling (UniGen) can handle formulas with up to million variables representing a significant boost up from the prior state of the art tools' capability to handle few hundreds of variables. If the initial set of constraints is expressed as Disjunctive Normal Form (DNF), ApproxMC is the only known Fully Polynomial Randomized Approximation Scheme (FPRAS) that does not involve Monte Carlo steps. By exploiting the connection between definability of formulas and variance of the distribution of solutions in a cell defined by 3-universal hash functions, we introduced an algorithmic technique, MIS, that reduced the size of XOR constraints employed in the underlying universal hash functions by as much as two orders of magnitude.