Raginsky, Maxim
Rademacher Complexity of Neural ODEs via Chen-Fliess Series
Hanson, Joshua, Raginsky, Maxim
Several recent works have examined continuous-depth idealizations of deep neural nets, viewing them as continuous-time ordinary differential equation (ODE) models with either fixed or time-varying parameters. Traditional discrete-layer nets can be recovered by applying an appropriate temporal discretization scheme, e.g., the Euler or Runge-Kutta methods. In applications, this perspective has resulted in advantages concerning regularization (Kelly et al., 2020; Kobyzev et al., 2021; Pal et al., 2021), efficient parameterization (Queiruga et al., 2020), convergence speed (Chen et al., 2023), applicability to non-uniform data (Sahin and Kozat, 2019), among others. As a theoretical tool, continuous-depth idealizations have lead to better understanding of the contribution of depth to model expressiveness and generalizability (Marion, 2023; Massaroli et al., 2020), new or improved training strategies via framing as an optimal control problem (Corbett and Kangin, 2022), and novel model variations (Jia and Benson, 2019; Peluchetti and Favaro, 2020). Considered as generic control systems, continuous-depth nets can admit a number of distinct inputoutput configurations depending on how the control system "anatomy" is delegated. Controlled neural ODEs (Kidger et al., 2020) and continuous-time recurrent neural nets (Fermanian et al., 2021) treat the (time-varying) control signal as the input to the model; the initial condition is either fixed or treated as a trainable parameter; the (time-varying) output signal is the model output; and any free parameters of the vector fields (weights) are held constant in time.
Transformer-Based Models Are Not Yet Perfect At Learning to Emulate Structural Recursion
Zhang, Dylan, Tigges, Curt, Zhang, Zory, Biderman, Stella, Raginsky, Maxim, Ringer, Talia
This paper investigates the ability of transformer-based models to learn structural recursion from examples. Recursion is a universal concept in both natural and formal languages. Structural recursion is central to the programming language and formal mathematics tasks where symbolic tools currently excel beyond neural models, such as inferring semantic relations between datatypes and emulating program behavior. We introduce a general framework that nicely connects the abstract concepts of structural recursion in the programming language domain to concrete sequence modeling problems and learned models' behavior. The framework includes a representation that captures the general \textit{syntax} of structural recursion, coupled with two different frameworks for understanding their \textit{semantics} -- one that is more natural from a programming languages perspective and one that helps bridge that perspective with a mechanistic understanding of the underlying transformer architecture. With our framework as a powerful conceptual tool, we identify different issues under various set-ups. The models trained to emulate recursive computations cannot fully capture the recursion yet instead fit short-cut algorithms and thus cannot solve certain edge cases that are under-represented in the training distribution. In addition, it is difficult for state-of-the-art large language models (LLMs) to mine recursive rules from in-context demonstrations. Meanwhile, these LLMs fail in interesting ways when emulating reduction (step-wise computation) of the recursive function.
A unified framework for information-theoretic generalization bounds
Chu, Yifeng, Raginsky, Maxim
This paper presents a general methodology for deriving information-theoretic generalization bounds for learning algorithms. The main technical tool is a probabilistic decorrelation lemma based on a change of measure and a relaxation of Young's inequality in $L_{\psi_p}$ Orlicz spaces. Using the decorrelation lemma in combination with other techniques, such as symmetrization, couplings, and chaining in the space of probability measures, we obtain new upper bounds on the generalization error, both in expectation and in high probability, and recover as special cases many of the existing generalization bounds, including the ones based on mutual information, conditional mutual information, stochastic chaining, and PAC-Bayes inequalities. In addition, the Fernique-Talagrand upper bound on the expected supremum of a subgaussian process emerges as a special case.
A Constructive Approach to Function Realization by Neural Stochastic Differential Equations
Veeravalli, Tanya, Raginsky, Maxim
The problem of function approximation by neural dynamical systems has typically been approached in a top-down manner: Any continuous function can be approximated to an arbitrary accuracy by a sufficiently complex model with a given architecture. This can lead to high-complexity controls which are impractical in applications. In this paper, we take the opposite, constructive approach: We impose various structural restrictions on system dynamics and consequently characterize the class of functions that can be realized by such a system. The systems are implemented as a cascade interconnection of a neural stochastic differential equation (Neural SDE), a deterministic dynamical system, and a readout map. Both probabilistic and geometric (Lie-theoretic) methods are used to characterize the classes of functions realized by such systems.
Generalization Bounds: Perspectives from Information Theory and PAC-Bayes
Hellstrรถm, Fredrik, Durisi, Giuseppe, Guedj, Benjamin, Raginsky, Maxim
A fundamental question in theoretical machine learning is generalization. Over the past decades, the PAC-Bayesian approach has been established as a flexible framework to address the generalization capabilities of machine learning algorithms, and design new ones. Recently, it has garnered increased interest due to its potential applicability for a variety of learning algorithms, including deep neural networks. In parallel, an information-theoretic view of generalization has developed, wherein the relation between generalization and various information measures has been established. This framework is intimately connected to the PAC-Bayesian approach, and a number of results have been independently discovered in both strands. In this monograph, we highlight this strong connection and present a unified treatment of generalization. We present techniques and results that the two perspectives have in common, and discuss the approaches and interpretations that differ. In particular, we demonstrate how many proofs in the area share a modular structure, through which the underlying ideas can be intuited. We pay special attention to the conditional mutual information (CMI) framework; analytical studies of the information complexity of learning algorithms; and the application of the proposed methods to deep learning. This monograph is intended to provide a comprehensive introduction to information-theoretic generalization bounds and their connection to PAC-Bayes, serving as a foundation from which the most recent developments are accessible. It is aimed broadly towards researchers with an interest in generalization and theoretical machine learning.
Can Transformers Learn to Solve Problems Recursively?
Zhang, Shizhuo Dylan, Tigges, Curt, Biderman, Stella, Raginsky, Maxim, Ringer, Talia
Neural networks have in recent years shown promise for helping software engineers write programs and even formally verify them. While semantic information plays a crucial part in these processes, it remains unclear to what degree popular neural architectures like transformers are capable of modeling that information. This paper examines the behavior of neural networks learning algorithms relevant to programs and formal verification proofs through the lens of mechanistic interpretability, focusing in particular on structural recursion. Structural recursion is at the heart of tasks on which symbolic tools currently outperform neural models, like inferring semantic relations between datatypes and emulating program behavior. We evaluate the ability of transformer models to learn to emulate the behavior of structurally recursive functions from input-output examples. Our evaluation includes empirical and conceptual analyses of the limitations and capabilities of transformer models in approximating these functions, as well as reconstructions of the ``shortcut" algorithms the model learns. By reconstructing these algorithms, we are able to correctly predict 91 percent of failure cases for one of the approximated functions. Our work provides a new foundation for understanding the behavior of neural networks that fail to solve the very tasks they are trained for.
Majorizing Measures, Codes, and Information
Chu, Yifeng, Raginsky, Maxim
The majorizing measure theorem of Fernique and Talagrand is a fundamental result in the theory of random processes. It relates the boundedness of random processes indexed by elements of a metric space to complexity measures arising from certain multiscale combinatorial structures, such as packing and covering trees. This paper builds on the ideas first outlined in a little-noticed preprint of Andreas Maurer to present an information-theoretic perspective on the majorizing measure theorem, according to which the boundedness of random processes is phrased in terms of the existence of efficient variable-length codes for the elements of the indexing metric space.
A Chain Rule for the Expected Suprema of Bernoulli Processes
Chu, Yifeng, Raginsky, Maxim
Sharp bounds on the suprema of Bernoulli processes are ubiquitous in applications of probability theory. For example, in statistical learning theory, they arise as so-called Rademacher complexities that quantify the effective "size" of a function class in the context of a given learning task [1]. To evaluate the generalization error bound of a learning algorithm, one usually needs to obtain a good estimate on the (empirical) Rademacher complexity of a suitable function class. In many scenarios, the function class at hand is composite, and separate assumptions are imposed on the constituent classes forming the composition. Thus, it is desirable to obtain a bound that takes into account the properties of the function classes in the composition.
Variational Principles for Mirror Descent and Mirror Langevin Dynamics
Tzen, Belinda, Raj, Anant, Raginsky, Maxim, Bach, Francis
Mirror descent, introduced by Nemirovski and Yudin in the 1970s, is a primal-dual convex optimization method that can be tailored to the geometry of the optimization problem at hand through the choice of a strongly convex potential function. It arises as a basic primitive in a variety of applications, including large-scale optimization, machine learning, and control. This paper proposes a variational formulation of mirror descent and of its stochastic variant, mirror Langevin dynamics. The main idea, inspired by the classic work of Brezis and Ekeland on variational principles for gradient flows, is to show that mirror descent emerges as a closed-loop solution for a certain optimal control problem, and the Bellman value function is given by the Bregman divergence between the initial condition and the global minimizer of the objective function.
Nonlinear controllability and function representation by neural stochastic differential equations
Veeravalli, Tanya, Raginsky, Maxim
There has been a great deal of recent interest in learning and approximation of functions that can be expressed as expectations of a given nonlinearity with respect to its random internal parameters. Examples of such representations include "infinitely wide" neural nets, where the underlying nonlinearity is given by the activation function of an individual neuron. In this paper, we bring this perspective to function representation by neural stochastic differential equations (SDEs). A neural SDE is an It\^o diffusion process whose drift and diffusion matrix are elements of some parametric families. We show that the ability of a neural SDE to realize nonlinear functions of its initial condition can be related to the problem of optimally steering a certain deterministic dynamical system between two given points in finite time. This auxiliary system is obtained by formally replacing the Brownian motion in the SDE by a deterministic control input. We derive upper and lower bounds on the minimum control effort needed to accomplish this steering; these bounds may be of independent interest in the context of motion planning and deterministic optimal control.