Goto

Collaborating Authors

 Kuznetsov, Vitaly


Multi-Class Deep Boosting

Neural Information Processing Systems

Our algorithms can use as a base classifier set a family of deep decision trees or other rich or complex families and yet benefit from strong generalization guarantees. We give new data-dependent learning bounds for convex ensembles in the multi-class classification setting expressed in terms of the Rademacher complexities of the sub-families composing the base classifier set, and the mixture weight assigned to each sub-family. These bounds are finer than existing ones both thanks to an improved dependency on the number of classes and, more crucially, by virtue of a more favorable complexity term expressed as an average of the Rademacher complexities based on the ensemble's mixture weights. We introduce and discuss several new multi-class ensemble algorithms benefiting from these guarantees, prove positive results for the H-consistency of several of them, and report the results of experiments showing that their performance compares favorably with that of multi-class versions of AdaBoost and Logistic Regression and their L1-regularized counterparts. Papers published at the Neural Information Processing Systems Conference.


Learning Theory and Algorithms for Forecasting Non-stationary Time Series

Neural Information Processing Systems

Our learning guarantees are expressed in terms of a data-dependent measure of sequential complexity and a discrepancy measure that can be estimated from data under some mild assumptions. We use our learning bounds to devise new algorithms for non-stationary time series forecasting for which we report some preliminary experimental results. Papers published at the Neural Information Processing Systems Conference.


AdaNet: A Scalable and Flexible Framework for Automatically Learning Ensembles

arXiv.org Machine Learning

AdaNet is a lightweight TensorFlow-based (Abadi et al., 2015) framework for automatically learning high-quality ensembles with minimal expert intervention. Our framework is inspired by the AdaNet algorithm (Cortes et al., 2017) which learns the structure of a neural network as an ensemble of subnetworks. We designed it to: (1) integrate with the existing TensorFlow ecosystem, (2) offer sensible default search spaces to perform well on novel datasets, (3) present a flexible API to utilize expert information when available, and (4) efficiently accelerate training with distributed CPU, GPU, and TPU hardware.


Efficient Gradient Computation for Structured Output Learning with Rational and Tropical Losses

Neural Information Processing Systems

Many structured prediction problems admit a natural loss function for evaluation such as the edit-distance or $n$-gram loss. However, existing learning algorithms are typically designed to optimize alternative objectives such as the cross-entropy. This is because a na\"{i}ve implementation of the natural loss functions often results in intractable gradient computations. In this paper, we design efficient gradient computation algorithms for two broad families of structured prediction loss functions: rational and tropical losses. These families include as special cases the $n$-gram loss, the edit-distance loss, and many other loss functions commonly used in natural language processing and computational biology tasks that are based on sequence similarity measures. Our algorithms make use of weighted automata and graph operations over appropriate semirings to design efficient solutions. They facilitate efficient gradient computation and hence enable one to train learning models such as neural networks with complex structured losses.


Efficient Gradient Computation for Structured Output Learning with Rational and Tropical Losses

Neural Information Processing Systems

Many structured prediction problems admit a natural loss function for evaluation such as the edit-distance or n-gram loss. However, existing learning algorithms are typically designed to optimize alternative objectives such as the cross-entropy. This is because a naรฏve implementation of the natural loss functions often results in intractable gradient computations. In this paper, we design efficient gradient computation algorithmsfor two broad families of structured prediction loss functions: rational and tropical losses. These families include as special cases the n-gram loss, the edit-distance loss, and many other loss functions commonly used in natural language processing and computational biology tasks that are based on sequence similarity measures. Our algorithms make use of weighted automata and graph operations over appropriate semirings to design efficient solutions. They facilitate efficient gradient computation and hence enable one to train learning models such as neural networks with complex structured losses.


Foundations of Sequence-to-Sequence Modeling for Time Series

arXiv.org Artificial Intelligence

The availability of large amounts of time series data, paired with the performance of deep-learning algorithms on a broad class of problems, has recently led to significant interest in the use of sequence-to-sequence models for time series forecasting. We provide the first theoretical analysis of this time series forecasting framework. We include a comparison of sequence-to-sequence modeling to classical time series models, and as such our theory can serve as a quantitative guide for practitioners choosing between different modeling methodologies.


Online Non-Additive Path Learning under Full and Partial Information

arXiv.org Machine Learning

We consider the online path learning problem in a graph with non-additive gains/losses. Various settings of full information, semi-bandit, and full bandit are explored. We give an efficient implementation of EXP3 algorithm for the full bandit setting with any (non-additive) gain. Then, focusing on the large family of non-additive count-based gains, we construct an intermediate graph which has equivalent gains that are additive. By operating on this intermediate graph, we are able to use algorithms like Component Hedge and ComBand for the first time for non-additive gains. Finally, we apply our methods to the important application of ensemble structured prediction.


Discriminative State Space Models

Neural Information Processing Systems

In this work, we introduce and study Discriminative State-Space Models (DSSMs) . We provide the precise mathematical definition of this class of models in Section 2 . Roughly speaking, a DSSM follows the same general structure as in ( 1) and consists of a state predictor g and an observation predictor h . However, no assumption is made about the form of the stochastic process used to generate observations. This family of models includes existing generative models and other state-based discriminative models (e.g. RNNs) as special cases, but also consists of some novel algorithmic solutions explored in this paper.


Structured Prediction Theory Based on Factor Graph Complexity

Neural Information Processing Systems

We present a general theoretical analysis of structured prediction with a series of new results. We give new data-dependent margin guarantees for structured prediction for a very wide family of loss functions and a general family of hypotheses, with an arbitrary factor graph decomposition. These are the tightest margin bounds known for both standard multi-class and general structured prediction problems. Our guarantees are expressed in terms of a data-dependent complexity measure, \emph{factor graph complexity}, which we show can be estimated from data and bounded in terms of familiar quantities for several commonly used hypothesis sets, and a sparsity measure for features and graphs. Our proof techniques include generalizations of Talagrand's contraction lemma that can be of independent interest. We further extend our theory by leveraging the principle of Voted Risk Minimization (VRM) and show that learning is possible even with complex factor graphs. We present new learning bounds for this advanced setting, which we use to devise two new algorithms, \emph{Voted Conditional Random Field} (VCRF) and \emph{Voted Structured Boosting} (StructBoost). These algorithms can make use of complex features and factor graphs and yet benefit from favorable learning guarantees. We also report the results of experiments with VCRF on several datasets to validate our theory.


Structured Prediction Theory Based on Factor Graph Complexity

arXiv.org Machine Learning

We present a general theoretical analysis of structured prediction with a series of new results. We give new data-dependent margin guarantees for structured prediction for a very wide family of loss functions and a general family of hypotheses, with an arbitrary factor graph decomposition. These are the tightest margin bounds known for both standard multi-class and general structured prediction problems. Our guarantees are expressed in terms of a data-dependent complexity measure, factor graph complexity, which we show can be estimated from data and bounded in terms of familiar quantities. We further extend our theory by leveraging the principle of Voted Risk Minimization (VRM) and show that learning is possible even with complex factor graphs. We present new learning bounds for this advanced setting, which we use to design two new algorithms, Voted Conditional Random Field (VCRF) and Voted Structured Boosting (StructBoost). These algorithms can make use of complex features and factor graphs and yet benefit from favorable learning guarantees. We also report the results of experiments with VCRF on several datasets to validate our theory.