Goto

Collaborating Authors

 Statistical Learning


Network Anomaly Detection: A Survey and Comparative Analysis of Stochastic and Deterministic Methods

arXiv.org Machine Learning

We present five methods to the problem of network anomaly detection. These methods cover most of the common techniques in the anomaly detection field, including Statistical Hypothesis Tests (SHT), Support Vector Machines (SVM) and clustering analysis. We evaluate all methods in a simulated network that consists of nominal data, three flow-level anomalies and one packet-level attack. Through analyzing the results, we point out the advantages and disadvantages of each method and conclude that combining the results of the individual methods can yield improved anomaly detection results.


Distribution-Dependent Sample Complexity of Large Margin Learning

arXiv.org Machine Learning

We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classifiers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning.


Integrated Pre-Processing for Bayesian Nonlinear System Identification with Gaussian Processes

arXiv.org Machine Learning

We introduce GP-FNARX: a new model for nonlinear system identification based on a nonlinear autoregressive exogenous model (NARX) with filtered regressors (F) where the nonlinear regression problem is tackled using sparse Gaussian processes (GP). We integrate data pre-processing with system identification into a fully automated procedure that goes from raw data to an identified model. Both pre-processing parameters and GP hyper-parameters are tuned by maximizing the marginal likelihood of the probabilistic model. We obtain a Bayesian model of the system's dynamics which is able to report its uncertainty in regions where the data is scarce. The automated approach, the modeling of uncertainty and its relatively low computational cost make of GP-FNARX a good candidate for applications in robotics and adaptive control.


Local Support Vector Machines:Formulation and Analysis

arXiv.org Machine Learning

We provide a formulation for Local Support Vector Machines (LSVMs) that generalizes previous formulations, and brings out the explicit connections to local polynomial learning used in nonparametric estimation literature. We investigate the simplest type of LSVMs called Local Linear Support Vector Machines (LLSVMs). For the first time we establish conditions under which LLSVMs make Bayes consistent predictions at each test point $x_0$. We also establish rates at which the local risk of LLSVMs converges to the minimum value of expected local risk at each point $x_0$. Using stability arguments we establish generalization error bounds for LLSVMs.


Efficient Orthogonal Tensor Decomposition, with an Application to Latent Variable Model Learning

arXiv.org Machine Learning

Decomposing tensors into orthogonal factors is a well-known task in statistics, machine learning, and signal processing. We study orthogonal outer product decompositions where the factors in the summands in the decomposition are required to be orthogonal across summands, by relating this orthogonal decomposition to the singular value decompositions of the flattenings. We show that it is a non-trivial assumption for a tensor to have such an orthogonal decomposition, and we show that it is unique (up to natural symmetries) in case it exists, in which case we also demonstrate how it can be efficiently and reliably obtained by a sequence of singular value decompositions. We demonstrate how the factoring algorithm can be applied for parameter identification in latent variable and mixture models.


Enhancements of Multi-class Support Vector Machine Construction from Binary Learners using Generalization Performance

arXiv.org Machine Learning

We propose several novel methods for enhancing the multi-class SVMs by applying the generalization performance of binary classifiers as the core idea. This concept will be applied on the existing algorithms, i.e., the Decision Directed Acyclic Graph (DDAG), the Adaptive Directed Acyclic Graphs (ADAG), and Max Wins. Although in the previous approaches there have been many attempts to use some information such as the margin size and the number of support vectors as performance estimators for binary SVMs, they may not accurately reflect the actual performance of the binary SVMs. We show that the generalization ability evaluated via a cross-validation mechanism is more suitable to directly extract the actual performance of binary SVMs. Our methods are built around this performance measure, and each of them is crafted to overcome the weakness of the previous algorithm. The proposed methods include the Reordering Adaptive Directed Acyclic Graph (RADAG), Strong Elimination of the classifiers (SE), Weak Elimination of the classifiers (WE), and Voting based Candidate Filtering (VCF). Experimental results demonstrate that our methods give significantly higher accuracy than all of the traditional ones. Especially, WE provides significantly superior results compared to Max Wins which is recognized as the state of the art algorithm in terms of both accuracy and classification speed with two times faster in average. Introduction The support vector machine (SVM) [1, 2] is a high performance learning algorithm constructing a hyperplane to separate two-class data by maximizing the margin between them. There are two approaches for extending SVMs to multi-class problems, i.e., solving the problem by formulating all classes of data under a single optimization, and combining several two-class subproblems. However, the difficulty and complexity to solve the problem with the first method are due to the increase of the number of classes and the size of training data, so the second method is more suitable for practical use. In this paper, we focus on the second approach. For constructing a multi-class classifier from binary ones, the method called one-against-one trains each binary classifier on only two out ofN classes, and builds N (N 1)/ 2 possible classifiers. Several strategies have been proposed for combining the trained classifiers to make the final classification for an unseen data. Friedman [3] suggested the combination strategy called Max Wins .


Sparse and Functional Principal Components Analysis

arXiv.org Machine Learning

Regularized principal components analysis, especially Sparse PCA and Functional PCA, has become widely used for dimension reduction in high-dimensional settings. Many examples of massive data, however, may benefit from estimating both sparse AND functional factors. These include neuroimaging data where there are discrete brain regions of activation (sparsity) but these regions tend to be smooth spatially (functional). Here, we introduce an optimization framework that can encourage both sparsity and smoothness of the row and/or column PCA factors. This framework generalizes many of the existing approaches to Sparse PCA, Functional PCA and two-way Sparse PCA and Functional PCA, as these are all special cases of our method. In particular, our method permits flexible combinations of sparsity and smoothness that lead to improvements in feature selection and signal recovery as well as more interpretable PCA factors. We demonstrate our method on simulated data and a neuroimaging example on EEG data. This work provides a unified framework for regularized PCA that can form the foundation for a cohesive approach to regularization in high-dimensional multivariate analysis.


Task-Driven Dictionary Learning

arXiv.org Machine Learning

Modeling data with linear combinations of a few elements from a learned dictionary has been the focus of much recent research in machine learning, neuroscience and signal processing. For signals such as natural images that admit such sparse representations, it is now well established that these models are well suited to restoration tasks. In this context, learning the dictionary amounts to solving a large-scale matrix factorization problem, which can be done efficiently with classical optimization tools. The same approach has also been used for learning features from data for other purposes, e.g., image classification, but tuning the dictionary in a supervised way for these tasks has proven to be more difficult. In this paper, we present a general formulation for supervised dictionary learning adapted to a wide variety of tasks, and present an efficient algorithm for solving the corresponding optimization problem. Experiments on handwritten digit classification, digital art identification, nonlinear inverse image problems, and compressed sensing demonstrate that our approach is effective in large-scale settings, and is well suited to supervised and semi-supervised classification, as well as regression tasks for data that admit sparse representations.


Exponentially Fast Parameter Estimation in Networks Using Distributed Dual Averaging

arXiv.org Machine Learning

In this paper we present an optimization-based view of distributed parameter estimation and observational social learning in networks. Agents receive a sequence of random, independent and identically distributed (i.i.d.) signals, each of which individually may not be informative about the underlying true state, but the signals together are globally informative enough to make the true state identifiable. Using an optimization-based characterization of Bayesian learning as proximal stochastic gradient descent (with Kullback-Leibler divergence from a prior as a proximal function), we show how to efficiently use a distributed, online variant of Nesterov's dual averaging method to solve the estimation with purely local information. When the true state is globally identifiable, and the network is connected, we prove that agents eventually learn the true parameter using a randomized gossip scheme. We demonstrate that with high probability the convergence is exponentially fast with a rate dependent on the KL divergence of observations under the true state from observations under the second likeliest state. Furthermore, our work also highlights the possibility of learning under continuous adaptation of network which is a consequence of employing constant, unit stepsize for the algorithm.


Spectral Clustering with Imbalanced Data

arXiv.org Machine Learning

Spectral clustering is sensitive to how graphs are constructed from data particularly when proximal and imbalanced clusters are present. We show that Ratio-Cut (RCut) or normalized cut (NCut) objectives are not tailored to imbalanced data since they tend to emphasize cut sizes over cut values. We propose a graph partitioning problem that seeks minimum cut partitions under minimum size constraints on partitions to deal with imbalanced data. Our approach parameterizes a family of graphs, by adaptively modulating node degrees on a fixed node set, to yield a set of parameter dependent cuts reflecting varying levels of imbalance. The solution to our problem is then obtained by optimizing over these parameters. We present rigorous limit cut analysis results to justify our approach. We demonstrate the superiority of our method through unsupervised and semi-supervised experiments on synthetic and real data sets.