Goto

Collaborating Authors

 Tu, Stephen


Stability properties of gradient flow dynamics for the symmetric low-rank matrix factorization problem

arXiv.org Artificial Intelligence

The symmetric low-rank matrix factorization serves as a building block in many learning tasks, including matrix recovery and training of neural networks. However, despite a flurry of recent research, the dynamics of its training via non-convex factorized gradient-descent-type methods is not fully understood especially in the over-parameterized regime where the fitted rank is higher than the true rank of the target matrix. To overcome this challenge, we characterize equilibrium points of the gradient flow dynamics and examine their local and global stability properties. To facilitate a precise global analysis, we introduce a nonlinear change of variables that brings the dynamics into a cascade connection of three subsystems whose structure is simpler than the structure of the original system. We demonstrate that the Schur complement to a principal eigenspace of the target matrix is governed by an autonomous system that is decoupled from the rest of the dynamics. In the over-parameterized regime, we show that this Schur complement vanishes at an $O(1/t)$ rate, thereby capturing the slow dynamics that arises from excess parameters. We utilize a Lyapunov-based approach to establish exponential convergence of the other two subsystems. By decoupling the fast and slow parts of the dynamics, we offer new insight into the shape of the trajectories associated with local search algorithms and provide a complete characterization of the equilibrium points and their global stability properties. Such an analysis via nonlinear control techniques may prove useful in several related over-parameterized problems.


Shallow diffusion networks provably learn hidden low-dimensional structure

arXiv.org Machine Learning

Generative models learn to sample from a target probability distribution given a dataset of examples. Applications are pervasive, and include language modeling (Li et al., 2022), high-fidelity image generation (Rombach et al., 2022), de-novo drug design (Watson et al., 2023), and molecular dynamics (Arts et al., 2023). Recent years have witnessed extremely rapid advancements in the field of generative modeling, particularly with the development of models based on dynamical transport of measure(Santambrogio, 2015), such as diffusion-based generative models (Ho et al., 2020; Song et al., 2021), stochastic interpolants (Albergo et al., 2023), flow matching(Lipman et al., 2023), and rectified flow(Liu et al., 2023) approaches. Yet, despite their strong empirical performance and well-grounded mathematical formulation, a theoretical understanding of how and why these large-scale generative models work is still in its infancy. A promising line of recent research has shown that the problem of sampling from an arbitrarily complex distribution can be reduced to unsupervised learning: for diffusion models, if an accurate velocity or score field can be estimated from data, then high-quality samples can be generated via numerical simulation(Chen et al., 2023a; Lee et al., 2023). While deeply insightful, these works leave open the difficulty of statistical estimation, and therefore raise the possibility that the sampling problem's true difficulty is hidden in the complexity of learning. In this work, we address this fundamental challenge by presenting an end-to-end analysis of sampling with score-based diffusion models. To balance tractability of the analysis with empirical relevance, we study the Barron space of single-layer neural networks (E et al., 2019; Bach, 2017).


Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss

arXiv.org Artificial Intelligence

In this work, we study statistical learning with dependent ($\beta$-mixing) data and square loss in a hypothesis class $\mathscr{F}\subset L_{\Psi_p}$ where $\Psi_p$ is the norm $\|f\|_{\Psi_p} \triangleq \sup_{m\geq 1} m^{-1/p} \|f\|_{L^m} $ for some $p\in [2,\infty]$. Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent data. Absent any realizability assumption, typical non-asymptotic results exhibit variance proxies that are deflated \emph{multiplicatively} by the mixing time of the underlying covariates process. We show that whenever the topologies of $L^2$ and $\Psi_p$ are comparable on our hypothesis class $\mathscr{F}$ -- that is, $\mathscr{F}$ is a weakly sub-Gaussian class: $\|f\|_{\Psi_p} \lesssim \|f\|_{L^2}^\eta$ for some $\eta\in (0,1]$ -- the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. Our result holds whether the problem is realizable or not and we refer to this as a \emph{near mixing-free rate}, since direct dependence on mixing is relegated to an additive higher order term. We arrive at our result by combining the above notion of a weakly sub-Gaussian class with mixed tail generic chaining. This combination allows us to compute sharp, instance-optimal rates for a wide range of problems. %Our approach, reliant on mixed tail generic chaining, allows us to obtain sharp, instance-optimal rates. Examples that satisfy our framework include sub-Gaussian linear regression, more general smoothly parameterized function classes, finite hypothesis classes, and bounded smoothness classes.


Learning Robust Output Control Barrier Functions from Safe Expert Demonstrations

arXiv.org Artificial Intelligence

We assume that a model of the system dynamics and a state estimator are available along with corresponding error bounds, e.g., estimated from data in practice. We first propose robust output control barrier functions (ROCBFs) as a means to guarantee safety, as defined through controlled forward invariance of a safe set. We then formulate an optimization problem to learn ROCBFs from expert demonstrations that exhibit safe system behavior, e.g., data collected from a human operator or an expert controller. When the parametrization of the ROCBF is linear, then we show that, under mild assumptions, the optimization problem is convex. Along with the optimization problem, we provide verifiable conditions in terms of the density of the data, smoothness of the system model and state estimator, and the size of the error bounds that guarantee validity of the obtained ROCBF. Towards obtaining a practical control algorithm, we propose an algorithmic implementation of our theoretical framework that accounts for assumptions made in our framework in practice. We empirically validate our algorithm in the autonomous driving simulator CARLA and demonstrate how to learn safe control laws from RGB camera images.


Multi-Task Imitation Learning for Linear Dynamical Systems

arXiv.org Artificial Intelligence

Imitation learning (IL), which learns control policies by imitating expert demonstrations, has demonstrated success across a variety of domains including self-driving cars (Codevilla et al., 2018) and robotics (Schaal, 1999). However, using IL to learn a robust behavior policy may require a large amount of training data (Ross et al., 2011), and expert demonstrations are often expensive to collect. One remedy for this problem is multi-task learning: using data from other tasks (source tasks) in addition to from the task of interest (target task) to jointly learn a policy. We study the application of multi-task learning to IL over linear systems, and demonstrate improved sample efficiency when learning a controller via representation learning. Our results expand on prior work that studies multi-task representation learning for supervised learning (Du et al., 2020; Tripuraneni et al., 2021), addressing the new challenges that arise in the imitation learning setting. First, the data for IL is temporally dependent, as it is generated from a dynamical system x [ t + 1] = f ( x[ t],u [t ],w [t ]). In contrast, the supervised learning setting assumes that both the train and test data are independent and identically distributed (i.i.d.) from the same underlying distribution. Furthermore, we are interested in the performance of the learned controller in closed-loop rather than its error on expert-controlled trajectories.


The noise level in linear regression with dependent data

arXiv.org Machine Learning

Ordinary least squares (OLS) regression from a finite sample is one of the most ubiquitous and widely used technique in machine learning. When faced with independent data, there are now sharp tools available to analyze its success optimally under relatively general assumptions. Indeed, a non-asymptotic theory matching the classical asymptotically optimal understanding from statistics [van der Vaart, 2000] has been developed over the last decade [Hsu et al., 2012, Oliveira, 2016, Mourtada, 2022]. However, once we relax the independence assumption and move toward data that exhibits correlations, the situation is much less well-understood--even for a problem as seemingly simple as linear regression. While sharp asymptotics are available through various limit theorems, there are no general results matching these in the finite sample regime. In this paper, we study the instance-specific performance of ordinary least squares in a setting with dependent data--and in contrast to much contemporary work on the theme--without imposing realizability.


Agile Catching with Whole-Body MPC and Blackbox Policy Learning

arXiv.org Artificial Intelligence

We address a benchmark task in agile robotics: catching objects thrown at high-speed. This is a challenging task that involves tracking, intercepting, and cradling a thrown object with access only to visual observations of the object and the proprioceptive state of the robot, all within a fraction of a second. We present the relative merits of two fundamentally different solution strategies: (i) Model Predictive Control using accelerated constrained trajectory optimization, and (ii) Reinforcement Learning using zeroth-order optimization. We provide insights into various performance trade-offs including sample efficiency, sim-to-real transfer, robustness to distribution shifts, and whole-body multimodality via extensive on-hardware experiments. We conclude with proposals on fusing "classical" and "learning-based" techniques for agile robot control. Videos of our experiments may be found at https://sites.google.com/view/agile-catching


Revisiting Energy Based Models as Policies: Ranking Noise Contrastive Estimation and Interpolating Energy Models

arXiv.org Artificial Intelligence

A crucial design decision for any robot learning pipeline is the choice of policy representation: what type of model should be used to generate the next set of robot actions? Owing to the inherent multi-modal nature of many robotic tasks, combined with the recent successes in generative modeling, researchers have turned to state-of-the-art probabilistic models such as diffusion models for policy representation. In this work, we revisit the choice of energy-based models (EBM) as a policy class. We show that the prevailing folklore -- that energy models in high dimensional continuous spaces are impractical to train -- is false. We develop a practical training objective and algorithm for energy models which combines several key ingredients: (i) ranking noise contrastive estimation (R-NCE), (ii) learnable negative samplers, and (iii) non-adversarial joint training. We prove that our proposed objective function is asymptotically consistent and quantify its limiting variance. On the other hand, we show that the Implicit Behavior Cloning (IBC) objective is actually biased even at the population level, providing a mathematical explanation for the poor performance of IBC trained energy policies in several independent follow-up works. We further extend our algorithm to learn a continuous stochastic process that bridges noise and data, modeling this process with a family of EBMs indexed by scale variable. In doing so, we demonstrate that the core idea behind recent progress in generative modeling is actually compatible with EBMs. Altogether, our proposed training algorithms enable us to train energy-based models as policies which compete with -- and even outperform -- diffusion models and other state-of-the-art approaches in several challenging multi-modal benchmarks: obstacle avoidance path planning and contact-rich block pushing.


Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners

arXiv.org Artificial Intelligence

Large language models (LLMs) exhibit a wide range of promising capabilities -- from step-by-step planning to commonsense reasoning -- that may provide utility for robots, but remain prone to confidently hallucinated predictions. In this work, we present KnowNo, which is a framework for measuring and aligning the uncertainty of LLM-based planners such that they know when they don't know and ask for help when needed. KnowNo builds on the theory of conformal prediction to provide statistical guarantees on task completion while minimizing human help in complex multi-step planning settings. Experiments across a variety of simulated and real robot setups that involve tasks with different modes of ambiguity (e.g., from spatial to numeric uncertainties, from human preferences to Winograd schemas) show that KnowNo performs favorably over modern baselines (which may involve ensembles or extensive prompt tuning) in terms of improving efficiency and autonomy, while providing formal assurances. KnowNo can be used with LLMs out of the box without model-finetuning, and suggests a promising lightweight approach to modeling uncertainty that can complement and scale with the growing capabilities of foundation models. Website: https://robot-help.github.io


Bootstrapped Representations in Reinforcement Learning

arXiv.org Artificial Intelligence

In reinforcement learning (RL), state representations are key to dealing with large or continuous state spaces. While one of the promises of deep learning algorithms is to automatically construct features well-tuned for the task they try to solve, such a representation might not emerge from end-to-end training of deep RL agents. To mitigate this issue, auxiliary objectives are often incorporated into the learning process and help shape the learnt state representation. Bootstrapping methods are today's method of choice to make these additional predictions. Yet, it is unclear which features these algorithms capture and how they relate to those from other auxiliary-task-based approaches. In this paper, we address this gap and provide a theoretical characterization of the state representation learnt by temporal difference learning (Sutton, 1988). Surprisingly, we find that this representation differs from the features learned by Monte Carlo and residual gradient algorithms for most transition structures of the environment in the policy evaluation setting. We describe the efficacy of these representations for policy evaluation, and use our theoretical analysis to design new auxiliary learning rules. We complement our theoretical results with an empirical comparison of these learning rules for different cumulant functions on classic domains such as the four-room domain (Sutton et al, 1999) and Mountain Car (Moore, 1990).