Goto

Collaborating Authors

 Directed Networks


Backward Feature Correction: How Deep Learning Performs Deep Learning

arXiv.org Machine Learning

How does a 110-layer ResNet learn a high-complexity classifier using relatively few training examples and short training time? We present a theory towards explaining this in terms of $\textit{hierarchical learning}$. We refer hierarchical learning as the learner learns to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning efficiently and automatically simply by applying stochastic gradient descent (SGD). On the conceptual side, we present, to the best of our knowledge, the FIRST theory result indicating how very deep neural networks can still be sample and time efficient on certain hierarchical learning tasks, when NO KNOWN non-hierarchical algorithms (such as kernel method, linear regression over feature mappings, tensor decomposition, sparse coding) are efficient. We establish a new principle called "backward feature correction", which we believe is the key to understand the hierarchical learning in multi-layer neural networks. On the technical side, we show for regression and even for binary classification, for every input dimension $d > 0$, there is a concept class consisting of degree $\omega(1)$ multi-variate polynomials so that, using $\omega(1)$-layer neural networks as learners, SGD can learn any target function from this class in $\mathsf{poly}(d)$ time using $\mathsf{poly}(d)$ samples to any $\frac{1}{\mathsf{poly}(d)}$ error, through learning to represent it as a composition of $\omega(1)$ layers of quadratic functions. In contrast, we present lower bounds stating that several non-hierarchical learners, including any kernel methods, neural tangent kernels, must suffer from $d^{\omega(1)}$ sample or time complexity to learn functions in this concept class even to any $d^{-0.01}$ error.


Numerical Sequence Prediction using Bayesian Concept Learning

arXiv.org Machine Learning

When people learn mathematical patterns or sequences, they are able to identify the concepts (or rules) underlying those patterns. Having learned the underlying concepts, humans are also able to generalize those concepts to other numbers, so far as to even identify previously unseen combinations of those rules. Current state-of-the art RNN architectures like LSTMs perform well in predicting successive elements of sequential data, but require vast amounts of training examples. Even with extensive data, these models struggle to generalize concepts. From our behavioral study, we also found that humans are able to disregard noise and identify the underlying rules generating the corrupted sequences. We therefore propose a Bayesian model that captures these human-like learning capabilities to predict next number in a given sequence, better than traditional LSTMs.


Multi-Sensor Data and Knowledge Fusion -- A Proposal for a Terminology Definition

arXiv.org Artificial Intelligence

Fusion is a common tool for the analysis and utilization of available datasets and so an essential part of data mining and machine learning processes. However, a clear definition of the type of fusion is not always provided due to inconsistent literature. In the following, the process of fusion is defined depending on the fusion components and the abstraction level on which the fusion occurs. The focus in the first part of the paper at hand is on the clear definition of the terminology and the development of an appropriate ontology of the fusion components and the fusion level. In the second part, common fusion techniques are presented.



Best-First Enumeration Based on Bounding Conflicts, and its Application to Large-scale Hybrid Estimation

Journal of Artificial Intelligence Research

There is an increasing desire for autonomous systems to have high levels of robustness and safety, attained through continuously planning and self-repairing online. Underlying this is the need to accurately estimate the system state and diagnose subtle failures. Estimation methods based on hybrid discrete and continuous state models have emerged as a method of precisely computing these estimates. However, existing methods have difficulty scaling to systems with more than a handful of components. Discrete, consistency based state estimation capabilities can scale to this level by combining best-first enumeration and conflict-directed search. While best-first methods have been developed for hybrid estimation, conflict-directed methods have thus far been elusive as conflicts learn inconsistencies from constraint violation, but probabilistic hybrid estimation is relatively unconstrained. In this paper we present an approach to hybrid estimation that unifies best-first enumeration and conflict-directed search through the concept of "bounding" conflicts, an extension of conflicts that represent tighter bounds on the cost of regions of the search space. This paper presents a general best-first enumeration algorithm based on bounding conflicts (A*BC) and a hybrid estimation method using this enumeration algorithm. Experiments show that an A*BC powered state estimator produces estimates up to an order of magnitude faster than the current state of the art, particularly on large systems.


Reproducible Bootstrap Aggregating

arXiv.org Machine Learning

Heterogeneity between training and testing data degrades reproducibility of a well-trained predictive algorithm. In modern applications, how to deploy a trained algorithm in a different domain is becoming an urgent question raised by many domain scientists. In this paper, we propose a reproducible bootstrap aggregating (Rbagging) method coupled with a new algorithm, the iterative nearest neighbor sampler (INNs), effectively drawing bootstrap samples from training data to mimic the distribution of the test data. Rbagging is a general ensemble framework that can be applied to most classifiers. We further propose Rbagging+ to effectively detect anomalous samples in the testing data. Our theoretical results show that the resamples based on Rbagging have the same distribution as the testing data. Moreover, under suitable assumptions, we further provide a general bound to control the test excess risk of the ensemble classifiers. The proposed method is compared with several other popular domain adaptation methods via extensive simulation studies and real applications including medical diagnosis and imaging classifications.


Unbiased and Efficient Log-Likelihood Estimation with Inverse Binomial Sampling

arXiv.org Machine Learning

The fate of scientific hypotheses often relies on the ability of a computational model to explain the data, quantified in modern statistical approaches by the likelihood function. The log-likelihood is the key element for parameter estimation and model evaluation. However, the log-likelihood of complex models in fields such as computational biology and neuroscience is often intractable to compute analytically or numerically. In those cases, researchers can often only estimate the log-likelihood by comparing observed data with synthetic observations generated by model simulations. Standard techniques to approximate the likelihood via simulation either use summary statistics of the data or are at risk of producing severe biases in the estimate. Here, we explore another method, inverse binomial sampling (IBS), which can estimate the log-likelihood of an entire data set efficiently and without bias. For each observation, IBS draws samples from the simulator model until one matches the observation. The log-likelihood estimate is then a function of the number of samples drawn. The variance of this estimator is uniformly bounded, achieves the minimum variance for an unbiased estimator, and we can compute calibrated estimates of the variance. We provide theoretical arguments in favor of IBS and an empirical assessment of the method for maximum-likelihood estimation with simulation-based models. As case studies, we take three model-fitting problems of increasing complexity from computational and cognitive neuroscience. In all problems, IBS generally produces lower error in the estimated parameters and maximum log-likelihood values than alternative sampling methods with the same average number of samples. Our results demonstrate the potential of IBS as a practical, robust, and easy to implement method for log-likelihood evaluation when exact techniques are not available.


Aggregated Learning: A Vector-Quantization Approach to Learning Neural Network Classifiers

arXiv.org Machine Learning

We consider the problem of learning a neural network classifier. Under the information bottleneck (IB) principle, we associate with this classification problem a representation learning problem, which we call "IB learning". We show that IB learning is, in fact, equivalent to a special class of the quantization problem. The classical results in rate-distortion theory then suggest that IB learning can benefit from a "vector quantization" approach, namely, simultaneously learning the representations of multiple input objects. Such an approach assisted with some variational techniques, result in a novel learning framework, "Aggregated Learning", for classification with neural network models. In this framework, several objects are jointly classified by a single neural network. The effectiveness of this framework is verified through extensive experiments on standard image recognition and text classification tasks. Introduction The revival of neural networks in the paradigm of deep learning (LeCun, Bengio, and Hinton 2015) has stimulated intense interest in understanding the networking of deep neural networks, e.g., (Shwartz-Ziv and Tishby 2017; Zhang et al. 2017). Among various efforts, an information-theoretic approach, information bottleneck (IB) (Tishby, Pereira, and Bialek 1999) stands out as a fundamental tool to theorize the learning of deep neural networks (Shwartz-Ziv and Tishby 2017; Saxe et al. 2018; Dai et al. 2018). Under the IB principle, the core of learning a neural network classifier is to find a representation T of the input example X, that contains as much as possible the information about X and as little as possible the information about the label Y .


Bayesian Semi-supervised learning under nonparanormality

arXiv.org Machine Learning

Semi-supervised learning is a classification method which makes use of both labeled data and unlabeled data for training. In this paper, we propose a semi-supervised learning algorithm using a Bayesian semi-supervised model. We make a general assumption that the observations will follow two multivariate normal distributions depending on their true labels after the same unknown transformation. We use B-splines to put a prior on the transformation function for each component. To use unlabeled data in a semi-supervised setting, we assume the labels are missing at random. The posterior distributions can then be described using our assumptions, which we compute by the Gibbs sampling technique. The proposed method is then compared with several other available methods through an extensive simulation study. Finally we apply the proposed method in real data contexts for diagnosing breast cancer and classify radar returns. We conclude that the proposed method has better prediction accuracy in a wide variety of cases.


Guidelines for enhancing data locality in selected machine learning algorithms

arXiv.org Machine Learning

To deal with the complexity of the new bigger and more complex generation of data, machine learning (ML) techniques are probably the first and foremost used. For ML algorithms to produce results in a reasonable amount of time, they need to be implemented efficiently. In this paper, we analyze one of the means to increase the performances of machine learning algorithms which is exploiting data locality. Data locality and access patterns are often at the heart of performance issues in computing systems due to the use of certain hardware techniques to improve performance. Altering the access patterns to increase locality can dramatically increase performance of a given algorithm. Besides, repeated data access can be seen as redundancy in data movement. Similarly, there can also be redundancy in the repetition of calculations. This work also identifies some of the opportunities for avoiding these redundancies by directly reusing computation results. We start by motivating why and how a more efficient implementation can be achieved by exploiting reuse in the memory hierarchy of modern instruction set processors. Next we document the possibilities of such reuse in some selected machine learning algorithms. Keywords: Increasing data locality, data redundancy and reuse, machine learning, supervised learners... Notice This an extended version of the paper titled "Reviewing Data Access Patterns and Computational Redundancy for Machine Learning Algorithms" that appeared in the proceedings of the IADIS International Conference Big Data Analytics, Data Mining and Computational Intelligence 2019 (part of MCCSIS 2019)" [19] The final publication of this article is available at IOS Press through http://dx.doi.org/10.3233/IDA-184287. Because processor speed is increasing at a much faster rate than memory speed, computer architects have turned increasingly to the use of memory hierarchies with one or more levels of cache memory. This caching technique takes advantage of data locality in programs which is the property that references to the same memory location (temporal locality) or adjacent locations (spatial locality) reused within a short period of time. 1 One of the most popular ways to increase it is to rewrite the data intensive parts of the program, almost always the loops [14]. A simple example of this is to interchange the two loops in Algorithm 1 such that the code looks like Algorithm 2; note that the indices in the loop headers have changed.