Ciliberto, Carlo
Hyperbolic Manifold Regression
Marconi, Gian Maria, Rosasco, Lorenzo, Ciliberto, Carlo
Geometric representation learning has recently shown great promise in several machine learning settings, ranging from relational learning to language processing and generative models. In this work, we consider the problem of performing manifold-valued regression onto an hyperbolic space as an intermediate component for a number of relevant machine learning applications. In particular, by formulating the problem of predicting nodes of a tree as a manifold regression task in the hyperbolic space, we propose a novel perspective on two challenging tasks: 1) hierarchical classification via label embeddings and 2) taxonomy extension of hyperbolic representations. To address the regression problem we consider previous methods as well as proposing two novel approaches that are computationally more advantageous: a parametric deep learning model that is informed by the geodesics of the target space and a non-parametric kernel-method for which we also prove excess risk bounds. Our experiments show that the strategy of leveraging the hyperbolic geometry is promising. In particular, in the taxonomy expansion setting, we find that the hyperbolic-based estimators significantly outperform methods performing regression in the ambient Euclidean space.
Localized Structured Prediction
Ciliberto, Carlo, Bach, Francis, Rudi, Alessandro
Key to structured prediction is exploiting the problem's structure to simplify the learning process. A major challenge arises when data exhibit a local structure (i.e., are made by parts'') that can be leveraged to better approximate the relation between (parts of) the input and (parts of) the output. Recent literature on signal processing, and in particular computer vision, shows that capturing these aspects is indeed essential to achieve state-of-the-art performance. However, in this context algorithms are typically derived on a case-by-case basis. In this work we propose the first theoretical framework to deal with part-based data from a general perspective and study a novel method within the setting of statistical learning theory.
Learning To Learn Around A Common Mean
Denevi, Giulia, Ciliberto, Carlo, Stamos, Dimitris, Pontil, Massimiliano
The problem of learning-to-learn (LTL) or meta-learning is gaining increasing attention due to recent empirical evidence of its effectiveness in applications. The goal addressed in LTL is to select an algorithm that works well on tasks sampled from a meta-distribution. In this work, we consider the family of algorithms given by a variant of Ridge Regression, in which the regularizer is the square distance to an unknown mean vector. We show that, in this setting, the LTL problem can be reformulated as a Least Squares (LS) problem and we exploit a novel meta- algorithm to efficiently solve it. At each iteration the meta-algorithm processes only one dataset.
A General Framework for Consistent Structured Prediction with Implicit Loss Embeddings
Ciliberto, Carlo, Rosasco, Lorenzo, Rudi, Alessandro
We propose and analyze a novel theoretical and algorithmic framework for structured prediction. While so far the term has referred to discrete output spaces, here we consider more general settings, such as manifolds or spaces of probability measures. We define structured prediction as a problem where the output space lacks a vectorial structure. We identify and study a large class of loss functions that implicitly defines a suitable geometry on the problem. The latter is the key to develop an algorithmic framework amenable to a sharp statistical analysis and yielding efficient computations. When dealing with output spaces with infinite cardinality, a suitable implicit formulation of the estimator is shown to be crucial.
Fast quantum learning with statistical guarantees
Ciliberto, Carlo, Rocchetto, Andrea, Rudi, Alessandro, Wossnig, Leonard
A wide class of quantum algorithms for learning problems exp loit fast quantum linear algebra subroutines to achieve runtimes that are exponentially faster than their classical counterparts [ Cil 18 ]. Examples of these algorithms are quantum support vector m achines [ RML14 ], quantum linear regression [ WBL12; SSP16 ], and quantum least squares [ KP17; CGJ18 ]. A careful analysis of these algorithms identified a number of caveats that limit their practical applicability such as the need for a strong form of quantum ac cess to the input data, restrictions on structural properties of the data matrix (such as conditi on number or sparsity), and modes of access to the output [ Aar15 ]. Furthermore, if one assumes that it is efficient to (classic ally) sample elements of the training data in a way proportional to their norm, then it is possible to show that classical algorithms are only polynomially slowe r (albeit the scaling of the quantum algorithms can be considerably better) [ Tan18; CL W18; Chi 19a; GLT18; Chi 19b ]. In this work we continue to investigate the limitations of qu antum algorithms for learning problems.
Random Expert Distillation: Imitation Learning via Expert Policy Support Estimation
Wang, Ruohan, Ciliberto, Carlo, Amadori, Pierluigi, Demiris, Yiannis
We consider a specific setting of imitation learning - the task of policy learning from expert demonstrations - in which the learner only has a finite number of expert trajectories without any further access to the expert. Two broad categories of approaches to this settings are behavioral cloning (BC) Pomerleau (1991), which directly learns a policy mapping from states to actions with supervised learning from expert trajectories; and inverse reinforcement learning (IRL) Ng & Russell (2000); Abbeel & Ng (2004), which learns a policy via reinforcement learning, using a cost function extracted from expert trajectories. Most notably, BC has been successfully applied to the task of autonomous driving Bojarski et al. (2016); Bansal et al. (2018). Despite its simplicity, BC typically requires a large amount of training data to learn good policies, as it may suffer from compounding errors caused by covariate shift Ross & Bagnell (2010); Ross et al. (2011). BC is often used as a policy initialization step for further reinforcement learning Nagabandi et al. (2018); Rajeswaran et al. (2017). IRL estimates a cost function from expert trajectories and uses reinforcement learning to derive policies. As the cost function evaluates the quality of trajectories rather than that of individual actions, IRL avoids the problem of compounding errors. IRL is effective with a wide range of problems, from continuous control benchmarks in the Mujoco environment Ho & Ermon (2016), to robot footsteps planning Ziebart et al. (2008). Generative Adversarial Imitation Learning (GAIL) Ho & Ermon (2016); Baram et al. (2017) connects IRL to the general framework of Generative Adversarial Networks (GANs) Goodfellow et al.
Sinkhorn Barycenters with Free Support via Frank-Wolfe Algorithm
Luise, Giulia, Salzo, Saverio, Pontil, Massimiliano, Ciliberto, Carlo
We present a novel algorithm to estimate the barycenter of arbitrary probability distributions with respect to the Sinkhorn divergence. Based on a Frank-Wolfe optimization strategy, our approach proceeds by populating the support of the barycenter incrementally, without requiring any pre-allocation. We consider discrete as well as continuous distributions, proving convergence rates of the proposed algorithm in both settings. Key elements of our analysis are a new result showing that the Sinkhorn divergence on compact domains has Lipschitz continuous gradient with respect to the Total Variation and a characterization of the sample complexity of Sinkhorn potentials. Experiments validate the effectiveness of our method in practice.
Learning-to-Learn Stochastic Gradient Descent with Biased Regularization
Denevi, Giulia, Ciliberto, Carlo, Grazzi, Riccardo, Pontil, Massimiliano
The problem of learning-to-learn (LTL) [4, 30] is receiving increasing attention in recent years, due to its practical importance [11, 26] and the theoretical challenge of statistically principled and efficient solutions [1, 2, 21, 23, 9, 10, 12]. The principal aim of LTL is to design a meta-learning algorithm to select a supervised learning algorithm that is well suited to learn tasks from a prescribed family. To highlight the difference between the meta-learning algorithm and the learning algorithm, throughout the paper we will refer to the latter as the inner or within-task algorithm. The meta-algorithm is trained from a sequence of datasets, associated with different learning tasks sampled from a meta-distribution (also called environment in the literature). The performance of the selected inner algorithm is measured by the transfer risk [4, 18], that is, the average risk of the algorithm, trained on a random dataset from the same environment. A key insight is that, when the learning tasks share specific similarities, the LTL framework provides a means to leverage such similarities and select an inner algorithm of low transfer risk. In this work, we consider environments of linear regression or binary classification tasks and we assume that the associated weight vectors are all close to a common vector.
Leveraging Low-Rank Relations Between Surrogate Tasks in Structured Prediction
Luise, Giulia, Stamos, Dimitris, Pontil, Massimiliano, Ciliberto, Carlo
We study the interplay between surrogate methods for structured prediction and techniques from multitask learning designed to leverage relationships between surrogate outputs. We propose an efficient algorithm based on trace norm regularization which, differently from previous methods, does not require explicit knowledge of the coding/decoding functions of the surrogate framework. As a result, our algorithm can be applied to the broad class of problems in which the surrogate space is large or even infinite dimensional. We study excess risk bounds for trace norm regularized structured prediction, implying the consistency and learning rates for our estimator. We also identify relevant regimes in which our approach can enjoy better generalization performance than previous methods. Numerical experiments on ranking problems indicate that enforcing low-rank relations among surrogate outputs may indeed provide a significant advantage in practice.
Are we done with object recognition? The iCub robot's perspective
Pasquale, Giulia, Ciliberto, Carlo, Odone, Francesca, Rosasco, Lorenzo, Natale, Lorenzo
We report on an extensive study of the benefits and limitations of current deep learning approaches to object recognition in robot vision scenarios, introducing a novel dataset used for our investigation. To avoid the biases in currently available datasets, we consider a natural human-robot interaction setting to design a data-acquisition protocol for visual object recognition on the iCub humanoid robot. Analyzing the performance of off-the-shelf models trained off-line on large-scale image retrieval datasets, we show the necessity for knowledge transfer. We evaluate different ways in which this last step can be done, and identify the major bottlenecks affecting robotic scenarios. By studying both object categorization and identification problems, we highlight key differences between object recognition in robotics applications and in image retrieval tasks, for which the considered deep learning approaches have been originally designed. In a nutshell, our results confirm the remarkable improvements yield by deep learning in this setting, while pointing to specific open challenges that need be addressed for seamless deployment in robotics.