In computer science, computational learning theory (or just learning theory) is a subfield of Artificial Intelligence devoted to studying the design and analysis of machine learning algorithms (Wikipedia)
I have monitoring system watching for bandwidth, connections and connections rates from multiple firewalls, which is stream of counters with interval 5 min. My current system create baseline from data for last 4 weeks and compare current value with baseline. It is ok but it either give me lots of false alerts or too slow to react without additional triggers. Is there anything better available today? Some system I can feed data in that will learn patterns and identify outages in real time.
When we learn something new, like a language or musical instrument, we often seek challenges at the edge of our competence--not so hard that we are discouraged, but not so easy that we get bored. This simple intuition, that there is a sweet spot of difficulty, a'Goldilocks zone'1, for motivation and learning is at the heart of modern teaching methods2 and is thought to account for differences in infant attention between more and less learnable stimuli1. In the animal learning literature it is the intuition behind shaping3 and fading4, whereby complex tasks are taught by steadily increasing the difficulty of a training task. It is also observable in the nearly universal'levels' feature in video games, in which the player is encouraged, or even forced, to a higher level of difficulty once a performance criterion has been achieved. Similarly in machine learning, steadily increasing the difficulty of training has proven useful for teaching large scale neural networks in a variety of tasks5,6, where it is known as'Curriculum Learning'7 and'Self-Paced Learning'8.
ICML, ICLR, and NeurIPS are all considering or experimenting with code and data submission as a part of the reviewer or publication process with the hypothesis that it aids reproducibility of results. Reproducibility has been a rising concern with discussions in paper, workshop, and invited talk. The fundamental driver is of course lack of reproducibility. Lack of reproducibility is an inherently serious and valid concern for any kind of publishing process where people rely on prior work to compare with and do new things. Lack of reproducibility (due to random initialization for example) was one of the things leading to a period of unpopularity for neural networks when I was a graduate student.
We consider the classical problem of learning rates for classes with finite VC dimension. It is well known that fast learning rates are achievable by the empirical risk minimization algorithm (ERM) if one of the low noise/margin assumptions such as Tsybakov's and Massart's condition is satisfied. In this paper, we consider an alternative way of obtaining fast learning rates in classification if none of these conditions are met. We first consider Chow's reject option model and show that by lowering the impact of a small fraction of hard instances, fast learning rate is achievable in an agnostic model by a specific learning algorithm. Similar results were only known under special versions of margin assumptions. We also show that the learning algorithm achieving these rates is adaptive to standard margin assumptions and always satisfies the risk bounds achieved by ERM. Based on our results on Chow's model, we then analyze a particular family of VC classes, namely classes with finite combinatorial diameter. Using their special structure, we show that there is an improper learning algorithm that provides fast rates of convergence even in the (poorly understood) situations where ERM is suboptimal. This provides the first setup in which an improper learning algorithm may significantly improve the learning rates for non-convex losses. Finally, we discuss some implications of our techniques to the analysis of ERM.
This paper introduces a new notion of dimensionality of probabilistic models from an information-theoretic view point. We call it the "descriptive dimension"(Ddim). We show that Ddim coincides with the number of independent parameters for the parametric class, and can further be extended to real-valued dimensionality when a number of models are mixed. The paper then derives the rate of convergence of the MDL (Minimum Description Length) learning algorithm which outputs a normalized maximum likelihood (NML) distribution with model of the shortest NML codelength. The paper proves that the rate is governed by Ddim. The paper also derives error probabilities of the MDL-based test for multiple model change detection. It proves that they are also governed by Ddim. Through the analysis, we demonstrate that Ddim is an intrinsic quantity which characterizes the performance of the MDL-based learning and change detection.
We consider learning problems where the training set consists of two types of examples: private and public. The goal is to design a learning algorithm that satisfies differential privacy only with respect to the private examples. This setting interpolates between private learning (where all examples are private) and classical learning (where all examples are public). We study the limits of learning in this setting in terms of private and public sample complexities. We show that any hypothesis class of VC-dimension $d$ can be agnostically learned up to an excess error of $\alpha$ using only (roughly) $d/\alpha$ public examples and $d/\alpha^2$ private labeled examples. This result holds even when the public examples are unlabeled. This gives a quadratic improvement over the standard $d/\alpha^2$ upper bound on the public sample complexity (where private examples can be ignored altogether if the public examples are labeled). Furthermore, we give a nearly matching lower bound, which we prove via a generic reduction from this setting to the one of private learning without public data.
Algorithmic machine teaching studies the interaction between a teacher and a learner where the teacher selects labeled examples aiming at teaching a target hypothesis. In a quest to lower teaching complexity and to achieve more natural teacher-learner interactions, several teaching models and complexity measures have been proposed for both the batch settings (e.g., worst-case, recursive, preference-based, and non-clashing models) as well as the sequential settings (e.g., local preference-based model). To better understand the connections between these different batch and sequential models, we develop a novel framework which captures the teaching process via preference functions $\Sigma$. In our framework, each function $\sigma \in \Sigma$ induces a teacher-learner pair with teaching complexity as $\TD(\sigma)$. We show that the above-mentioned teaching models are equivalent to specific types/families of preference functions in our framework. This equivalence, in turn, allows us to study the differences between two important teaching models, namely $\sigma$ functions inducing the strongest batch (i.e., non-clashing) model and $\sigma$ functions inducing a weak sequential (i.e., local preference-based) model. Finally, we identify preference functions inducing a novel family of sequential models with teaching complexity linear in the VC dimension of the hypothesis class: this is in contrast to the best known complexity result for the batch models which is quadratic in the VC dimension.
The generalization bounds for stable algorithms is a classical question in learning theory taking its roots in the early works of Vapnik and Chervonenkis and Rogers and Wagner. In a series of recent breakthrough papers, Feldman and Vondrak have shown that the best known high probability upper bounds for uniformly stable learning algorithms due to Bousquet and Elisseeff are sub-optimal in some natural regimes. To do so, they proved two generalization bounds that significantly outperform the original generalization bound. Feldman and Vondrak also asked if it is possible to provide sharper bounds and prove corresponding high probability lower bounds. This paper is devoted to these questions: firstly, inspired by the original arguments of, we provide a short proof of the moment bound that implies the generalization bound stronger than both recent results. Secondly, we prove general lower bounds, showing that our moment bound is sharp (up to a logarithmic factor) unless some additional properties of the corresponding random variables are used. Our main probabilistic result is a general concentration inequality for weakly correlated random variables, which may be of independent interest.
We know that Machine Learning is an extremely powerful tool for tackling complex problems which we don't know how to solve by conventional means. Problems like image classification can be solved effectively by Machine Learning because at the end of the day, gathering data for that kind of task is much easier than coming up with hand-written rules for such a complex and difficult problem. But what about problems we already know how to solve? Is there any reason to apply Machine Learning to problems we already have working solutions for? Tasks such as physics simulation, where the rules and equations governing the task are already well known and explored?