Inductive Learning
Reviews: On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks
The paper make use of (relatively) recent advances in complexity theory to show that of many common learning problems do not allow subquadratic time learning algorithms (given the veracity of the "Strong Exponential Time Hypothesis"). I appreciate that the authors do not oversell their results: They clearly state that they provide a worst-case analysis. Also, the results are not surprising. For instance, finding the exact solution of any kernel method requires the computation of the full kernel matrix, which is already quadratic in number of training examples. Reducing this computation time would imply that one can compute an approximation of the exact solution without computing the full kernel matrix, which is intuitively unlikely, unless he makes extra assumptions on the problem structure (i.e., the nature of the data-generating distribution).
Reviews: Deep Structured Prediction with Nonlinear Output Transformations
This paper studies the problem of training deep structured models (models where the dependencies between the output variables are explicitly modelled and some components are modelled via neural networks). The key idea of this paper is to give up the standard modelling assumption of structured prediction: the score (or the energy) function is the sum of summands (potentials). Instead of using the sum the paper puts an arbitrary non-linear (a neural network) transformation on top of the potentials. The paper develops an inference (MAP prediction) technique for such models which is based on Lagrangian decomposition (often referred to as dual decomposition, see details below). The training of the model is done by combining this inference technique with the standard Structure SVM (SSVM) objective.
Reviews: On Structured Prediction Theory with Calibrated Convex Surrogate Losses
The paper examines consistency of surrogate losses for multiclass prediction. The authors present their results using the formalism of structured prediction. Alas, there is no direct connection or exploitation of the separability of structured prediction losses. The paper is overly notated and variables are frequently overloaded. I got the feeling that the authors are trying to look mathematically fancy at the expense of readability.
Reviews: The Sample Complexity of Semi-Supervised Learning with Nonparametric Mixture Models
This leads to a much more general analysis than earlier work in SSL, both considering misspecification of the mixture and more than 2 classes. They propose several methods to recover the true mapping of decision regions to classes, for which they show both the sample complexity and show empirical results of the probability of correct recovery in three example simulations.
Reviews: Efficient Modeling of Latent Information in Supervised Learning using Gaussian Processes
I maintain my assessment and do not recommend publication at this stage. The core contribution is representing conditions with latent variables, and deriving a VI algorithm to cope with intractibility. This is interesting, but the discussion around it could be much improved. Some possible improvements are addressed in the author feedback, eg I'm not sure how Fig 1 could have been understood without the complementary explanation brought up in the feedback. Beyond what has been addressed in the author feedback, some work is needed to make this paper appealing (which the idea under study, the method and the results seem to call for): - clarifying the mathematical formulation, eg what forms of k_H are we examining, provide a full probabilistic model summary of the model, point out design choices - pointing out differences or similarities with existing work - remove gratuitous reference to deep learning in intro (it detracts) - make sure that all important questions a reader might have are addressed # Overall assessment The issue addressed (modelling univariate outputs which were generated under different, known conditions) and the modelling choice (representing conditions as latent variables) are interesting.
Reviews: Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification
Summary: This paper proposes a new reduction from multi-class classification to binary classification that is especially suitable when the number of classes is very large. They consider a hypothesis that map (input,class) pairs to scores, and the underlying loss function counts the fraction of the wrong classes that are scored higher than the true class. More specifically, they suppose they have a feature transformation phi that maps (input,class) pairs to a p-dimensional feature space, and they learn a mapping from R p to scores. Their reduction extends the work of Joshi et al. (2015) which, for each data point (x,y), creates K-1 transformed points where each transformed point intuitively corresponds to the comparison of label y with some incorrect label y'. Given that the transformed dataset contains correlated training examples, many standard generalization bounds cannot be applied.
Reviews: Learning to Model the Tail
Summary ------- The paper proposes an approach for transfer learning for multi-class classification problems that aids the learning of categories with few training examples (the categories in the tail of the distribution of numbers of examples per category). It is based on ideas of meta-learning: it builds a (meta-)model of the dynamics that accompany the change in model parameters as more training data is made available to a classifier. Specifically, the proposed approach takes inspiration from existing work on meta-learning [20] but extends it by applying it to CNNs, utilizing deep residual networks as the meta-model, and applying the framework to a general'long-tail problem' setting in which the number of training examples available is different and not fixed between categories. Experiments are conducted on curated versions of existing datasets (curated such that they exhibit strong long-tail distributions): SUN-397 [13], Places [7], and ImageNet [5]. The performance of the proposed method is demonstrated to be considerably higher than several more adhoc baselines from the literature.
Reviews: A Smoother Way to Train Structured Prediction Models
Overview: This paper proposes an accelerated variance-reduction algorithm for training structured predictors. In this approach the training objective is augmented with a proximal term anchored with a momentum point (eq (3)), the loss is smoothed using Nesterov's smoothing method (adding entropy or L2 to the dual), and a linear-rate solver (SVRG) is applied to the resulting objective in the inner loop. This achieves accelerated convergence rates for training. Comments: * I think that the connection to structured prediction is somewhat weak. In particular, the analysis uses the finite sum and smoothability of the training objective.
Reviews: Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding
This paper uses neural networks to parse visual scenes and language queries, transforming them into a logical representation that can be used to compute the output of the query on the scene. The logical representation is learned via a combination of direct supervision via a small number of traces and fine-tuning using end-to-end reinforcement learning. Advantages of the approach over existing approaches include: Reduction in the number of training examples, a more interpretable inference process and substantially increased accuracy. The overall approach shows great promise in increasing the performance of neural architectures by incorporating a symbolic component, as well as making them more robust, interpretable and debuggable. So I think this is a good direction for AI research to go in.
Reviews: Diverse Ensemble Evolution: Curriculum Data-Model Marriage
This paper proposes a new technique for training ensembles of predictors for supervised-learning tasks. Their main insight is to train individual members of the ensemble in a manner such that they specialize on different parts of the dataset reducing redundancy amongst members and better utilizing the capacity of the individual members. The hope is that ensembles formed out of such predictors will perform better than traditional ensembling techniques. The proposed technique explicitly enforces diversity in two ways: 1. inter-model diversity which makes individual models (predictors) different from each other and 2. intra-model diversity which makes predictors choose data points which are not all similar to each other so that they don't specialize in a very narrow region of the data distribution. This is posed as a bipartite graph matching problem which aims to find a matching between samples and models by selecting edges such that the smallest sum of edge costs is chosen (this is inverted to a maximization problem by subtracting from the highest constant cost one can have on the edges.) To avoid degenerate assignments another matching constraint is introduced which restricts the size of samples selected by each model as well.