Computational Learning Theory
Towards a Robust Classifier: An MDL-Based Method for Generating Adversarial Examples
Asadi, Behzad, Varadharajan, Vijay
We address the problem of adversarial examples in machine learning where an adversary tries to misguide a classifier by making functionality-preserving modifications to original samples. We assume a black-box scenario where the adversary has access to only the feature set, and the final hard-decision output of the classifier. We propose a method to generate adversarial examples using the minimum description length (MDL) principle. Our final aim is to improve the robustness of the classifier by considering generated examples in rebuilding the classifier. We evaluate our method for the application of static malware detection in portable executable (PE) files. We consider API calls of PE files as their distinguishing features where the feature vector is a binary vector representing the presence-absence of API calls. In our method, we first create a dataset of benign samples by querying the target classifier. We next construct a code table of frequent patterns for the compression of this dataset using the MDL principle. We finally generate an adversarial example corresponding to a malware sample by selecting and adding a pattern from the benign code table to the malware sample. The selected pattern is the one that minimizes the length of the compressed adversarial example given the code table. This modification preserves the functionalities of the original malware sample as all original API calls are kept, and only some new API calls are added. Considering a neural network, we show that the evasion rate is 78.24 percent for adversarial examples compared to 8.16 percent for original malware samples. This shows the effectiveness of our method in generating examples that need to be considered in rebuilding the classifier.
On the Sample Complexity of Learning Sum-Product Networks
Aden-Ali, Ishaq, Ashtiani, Hassan
Sum-Product Networks (SPNs) can be regarded as a form of deep graphical models that compactly represent deeply factored and mixed distributions. An SPN is a rooted directed acyclic graph (DAG) consisting of a set of leaves (corresponding to base distributions), a set of sum nodes (which represent mixtures of their children distributions) and a set of product nodes (representing the products of its children distributions). In this work, we initiate the study of the sample complexity of PAC-learning the set of distributions that correspond to SPNs. We show that the sample complexity of learning tree structured SPNs with the usual type of leaves (i.e., Gaussian or discrete) grows at most linearly (up to logarithmic factors) with the number of parameters of the SPN. More specifically, we show that the class of distributions that corresponds to tree structured Gaussian SPNs with $k$ mixing weights and $e$ ($d$-dimensional Gaussian) leaves can be learned within Total Variation error $\epsilon$ using at most $\widetilde{O}(\frac{ed^2+k}{\epsilon^2})$ samples. A similar result holds for tree structured SPNs with discrete leaves. We obtain the upper bounds based on the recently proposed notion of distribution compression schemes. More specifically, we show that if a (base) class of distributions $\mathcal{F}$ admits an "efficient" compression, then the class of tree structured SPNs with leaves from $\mathcal{F}$ also admits an efficient compression.
Rademacher complexity and spin glasses: A link between the replica and statistical theories of learning
Abbara, Alia, Aubin, Benjamin, Krzakala, Florent, Zdeborová, Lenka
Statistical learning theory provides bounds of the generalization gap, using in particular the Vapnik-Chervonenkis dimension and the Rademacher complexity. An alternative approach, mainly studied in the statistical physics literature, is the study of generalization in simple synthetic-data models. Here we discuss the connections between these approaches and focus on the link between the Rademacher complexity in statistical learning and the theories of generalization for typical-case synthetic models from statistical physics, involving quantities known as Gardner capacity and ground state energy. We show that in these models the Rademacher complexity is closely related to the ground state energy computed by replica theories. Using this connection, one may reinterpret many results of the literature as rigorous Rademacher bounds in a variety of models in the high-dimensional statistics limit. Somewhat surprisingly, we also show that statistical learning theory provides predictions for the behavior of the ground-state energies in some full replica symmetry breaking models.
PAC learning with stable and private predictions
We study binary classification algorithms for which the prediction on any point is not too sensitive to individual examples in the dataset. Specifically, we consider the notions of uniform stability (Bousquet and Elisseeff, 2001) and prediction privacy (Dwork and Feldman, 2018). Previous work on these notions shows how they can be achieved in the standard PAC model via simple aggregation of models trained on disjoint subsets of data. Unfortunately, this approach leads to a significant overhead in terms of sample complexity. Here we demonstrate several general approaches to stable and private prediction that either eliminate or significantly reduce the overhead. Specifically, we demonstrate that for any class $C$ of VC dimension $d$ there exists a $\gamma$-uniformly stable algorithm for learning $C$ with excess error $\alpha$ using $\tilde O(d/(\alpha\gamma) + d/\alpha^2)$ samples. We also show that this bound is nearly tight. For $\epsilon$-differentially private prediction we give two new algorithms: one using $\tilde O(d/(\alpha^2\epsilon))$ samples and another one using $\tilde O(d^2/(\alpha\epsilon) + d/\alpha^2)$ samples. The best previously known bounds for these problems are $O(d/(\alpha^2\gamma))$ and $O(d/(\alpha^3\epsilon))$, respectively.
r/MachineLearning - [D] AI to monitor network
I have monitoring system watching for bandwidth, connections and connections rates from multiple firewalls, which is stream of counters with interval 5 min. My current system create baseline from data for last 4 weeks and compare current value with baseline. It is ok but it either give me lots of false alerts or too slow to react without additional triggers. Is there anything better available today? Some system I can feed data in that will learn patterns and identify outages in real time.
Vouw: Geometric Pattern Mining using the MDL Principle
Faas, Micky, van Leeuwen, Matthijs
We introduce geometric pattern mining, the problem of finding recurring local structure in discrete, geometric matrices. It differs from existing pattern mining problems by identifying complex spatial relations between elements, resulting in arbitrarily shaped patterns. After we formalise this new type of pattern mining, we propose an approach to selecting a set of patterns using the Minimum Description Length principle. We demonstrate the potential of our approach by introducing Vouw, a heuristic algorithm for mining exact geometric patterns. We show that Vouw delivers high-quality results with a synthetic benchmark.
Learning Internal Representations (PhD Thesis)
Most machine learning theory and practice is concerned with learning a single task. In this thesis it is argued that in general there is insufficient information in a single task for a learner to generalise well and that what is required for good generalisation is information about many similar learning tasks. Similar learning tasks form a body of prior information that can be used to constrain the learner and make it generalise better. Examples of learning scenarios in which there are many similar tasks are handwritten character recognition and spoken word recognition. The concept of the environment of a learner is introduced as a probability measure over the set of learning problems the learner might be expected to learn. It is shown how a sample from the environment may be used to learn a representation, or recoding of the input space that is appropriate for the environment. Learning a representation can equivalently be thought of as learning the appropriate features of the environment. Bounds are derived on the sample size required to ensure good generalisation from a representation learning process. These bounds show that under certain circumstances learning a representation appropriate for $n$ tasks reduces the number of examples required of each task by a factor of $n$. Once a representation is learnt it can be used to learn novel tasks from the same environment, with the result that far fewer examples are required of the new tasks to ensure good generalisation. Bounds are given on the number of tasks and the number of samples from each task required to ensure that a representation will be a good one for learning novel tasks. The results on representation learning are generalised to cover any form of automated hypothesis space bias.
The Eighty Five Percent Rule for optimal learning
When we learn something new, like a language or musical instrument, we often seek challenges at the edge of our competence--not so hard that we are discouraged, but not so easy that we get bored. This simple intuition, that there is a sweet spot of difficulty, a'Goldilocks zone'1, for motivation and learning is at the heart of modern teaching methods2 and is thought to account for differences in infant attention between more and less learnable stimuli1. In the animal learning literature it is the intuition behind shaping3 and fading4, whereby complex tasks are taught by steadily increasing the difficulty of a training task. It is also observable in the nearly universal'levels' feature in video games, in which the player is encouraged, or even forced, to a higher level of difficulty once a performance criterion has been achieved. Similarly in machine learning, steadily increasing the difficulty of training has proven useful for teaching large scale neural networks in a variety of tasks5,6, where it is known as'Curriculum Learning'7 and'Self-Paced Learning'8.
On Robustness to Adversarial Examples and Polynomial Optimization
Awasthi, Pranjal, Dutta, Abhratanu, Vijayaraghavan, Aravindan
We study the design of computationally efficient algorithms with provable guarantees, that are robust to adversarial (test time) perturbations. While there has been an proliferation of recent work on this topic due to its connections to test time robustness of deep networks, there is limited theoretical understanding of several basic questions like (i) when and how can one design provably robust learning algorithms? (ii) what is the price of achieving robustness to adversarial examples in a computationally efficient manner? The main contribution of this work is to exhibit a strong connection between achieving robustness to adversarial examples, and a rich class of polynomial optimization problems, thereby making progress on the above questions. In particular, we leverage this connection to (a) design computationally efficient robust algorithms with provable guarantees for a large class of hypothesis, namely linear classifiers and degree-2 polynomial threshold functions (PTFs), (b) give a precise characterization of the price of achieving robustness in a computationally efficient manner for these classes, (c) design efficient algorithms to certify robustness and generate adversarial attacks in a principled manner for 2-layer neural networks. We empirically demonstrate the effectiveness of these attacks on real data.
Rethinking Generalisation
Marcu, Antonia, Prügel-Bennett, Adam
Vision, Learning and Control University of Southampton Southampton, UK Abstract In this paper, we present a new approach to computing the generalisation performance assuming that the distribution of risks, ρ (r), for a learning scenario is known. This allows us to compute the expected error of a learning machine using empirical risk minimisation. We show that it is possible to obtain results for both classification and regression. We show a critical quantity in determining the generalisation performance is the power-law behaviour of ρ ( r) around its minimum value. We compute ρ ( r) for the case of all Boolean functions and for the perceptron. We start with a simplistic analysis but then do a more formal one later on. We show that the simplistic results are qualitatively correct and provide a good approximation to the actual results if we replace the true training set size with an approximate training set size. Keywords: Generalisation, Learning Theory 1. Introduction Traditional computational learning theory aims to eliminate all rules that do not correctly explain the data. A rule can be thought of as a fixed set of parameters of a learning machine; more formally, a hypothesis. This process relies on the idea that rules with poor generalisation performance (high risk) will, with high probability, make errors on a sufficiently large randomly chosen training data set (Vapnik and Chervonenkis, 1971; Valiant, 1984; Baum and Haussler, 1989; Blumer et al., 1989; Haussler, 1992; Vapnik, 1992). Suppose there exists a mechanism for selecting a rule from the subset of rules that have the lowest errors on the training set. Then, there is a very small probability that any of the selected rules has a high risk. However, this crucially depends on there being effectively a finite number of hypotheses, otherwise, there could still be a high-risk set of parameters which by chance did well on the particular training set. In the case where the learning machine has a continuous parameter space (so that the dimensionality of the space is uncountably infinite), we consider the effective size of the hypothesis space to be the Vapnik-Chervonenkis (VC) dimension. The VC dimension measures the number of possible ways in which the machine can give different outputs to a finite number of training examples (Vapnik and Chervonenkis, 1971). This effective size or capacity lies at the heart of conventional computational learning theory. By limiting the capacity we can obtain stronger bounds on the generalisation performance. In this paper, we challenge this traditional approach.