Goto

Collaborating Authors

 dynamic


Appendix A Broader Impact

Neural Information Processing Systems

Overconfidence in deep neural networks could easily lead to deployments where predictions are made that should have been withheld. For validation set, on the other hand, we care about the confidence of the "top predicted class". Independent binning: when training samples and validation samples are grouped independently into their respective training-bins and validation-bins (Figure 1). The binning is adaptive with 15 equal-mass bins. Figure 10: Common binning: training samples are grouped using the bin boundaries of the validation-bins.


The Simplicity Bias in Multi-Task RNNs: Shared Attractors, Reuse of Dynamics, and Geometric Representation

Neural Information Processing Systems

The forces shaping joint dynamics of multiple tasks, however, are largely unexplored. In this work, we first construct a systematic framework to study multiple tasks in RNNs, minimizing interference from input and output correlations with the hidden representation.


Local Linear Convergence of Gradient Methods for Subspace Optimization via Strict Complementarity

Neural Information Processing Systems

In this work we bridge these two approaches under a strict complementarity assumption, which in particular implies that the optimal solution to the convex relaxation is unique and is also the optimal solution to the original nonconvex problem.



Bias in Motion: Theoretical Insights into the Dynamics of Bias in SGD Training

Neural Information Processing Systems

Machine learning systems often acquire biases by leveraging undesired features in the data, impacting accuracy variably across different sub-populations of the data. However, our current understanding of bias formation mostly focuses on the initial and final stages of learning, leaving a gap in knowledge regarding the transient dynamics. To address this gap, this paper explores the evolution of bias in a teacher-student setup that models different data sub-populations with a Gaussian-mixture model. We provide an analytical description of the stochastic gradient descent dynamics of a linear classifier in this setup, which we prove to be exact in high dimension.Notably, our analysis identifies different properties of the sub-populations that drive bias at different timescales and hence shows a shifting preference of our classifier during training. By applying our general solution to fairness and robustness, we delineate how and when heterogeneous data and spurious features can generate and amplify bias.


An Analytical Theory of Power Law Spectral Bias in the Learning Dynamics of Diffusion Models

Wang, Binxu

arXiv.org Machine Learning

We developed an analytical framework for understanding how the learned distribution evolves during diffusion model training. Leveraging the Gaussian equivalence principle, we derived exact solutions for the gradient-flow dynamics of weights in one- or two-layer linear denoiser settings with arbitrary data. Remarkably, these solutions allowed us to derive the generated distribution in closed form and its KL divergence through training. These analytical results expose a pronounced power-law spectral bias, i.e., for weights and distributions, the convergence time of a mode follows an inverse power law of its variance. Empirical experiments on both Gaussian and image datasets demonstrate that the power-law spectral bias remains robust even when using deeper or convolutional architectures. Our results underscore the importance of the data covariance in dictating the order and rate at which diffusion models learn different modes of the data, providing potential explanations for why earlier stopping could lead to incorrect details in image generative models.


Pioneer: Physics-informed Riemannian Graph ODE for Entropy-increasing Dynamics

Sun, Li, Zhang, Ziheng, Wang, Zixi, Wang, Yujie, Wan, Qiqi, Li, Hao, Peng, Hao, Yu, Philip S.

arXiv.org Artificial Intelligence

Dynamic interacting system modeling is important for understanding and simulating real world systems. The system is typically described as a graph, where multiple objects dynamically interact with each other and evolve over time. In recent years, graph Ordinary Differential Equations (ODE) receive increasing research attentions. While achieving encouraging results, existing solutions prioritize the traditional Euclidean space, and neglect the intrinsic geometry of the system and physics laws, e.g., the principle of entropy increasing. The limitations above motivate us to rethink the system dynamics from a fresh perspective of Riemannian geometry, and pose a more realistic problem of physics-informed dynamic system modeling, considering the underlying geometry and physics law for the first time. In this paper, we present a novel physics-informed Riemannian graph ODE for a wide range of entropy-increasing dynamic systems (termed as Pioneer). In particular, we formulate a differential system on the Riemannian manifold, where a manifold-valued graph ODE is governed by the proposed constrained Ricci flow, and a manifold preserving Gyro-transform aware of system geometry. Theoretically, we report the provable entropy non-decreasing of our formulation, obeying the physics laws. Empirical results show the superiority of Pioneer on real datasets.


Dynamics of Transient Structure in In-Context Linear Regression Transformers

Carroll, Liam, Hoogland, Jesse, Farrugia-Roberts, Matthew, Murfet, Daniel

arXiv.org Artificial Intelligence

Modern deep neural networks display striking examples of rich internal computational structure. Uncovering principles governing the development of such structure is a priority for the science of deep learning. In this paper, we explore the transient ridge phenomenon: when transformers are trained on in-context linear regression tasks with intermediate task diversity, they initially behave like ridge regression before specializing to the tasks in their training distribution. This transition from a general solution to a specialized solution is revealed by joint trajectory principal component analysis. Further, we draw on the theory of Bayesian internal model selection to suggest a general explanation for the phenomena of transient structure in transformers, based on an evolving tradeoff between loss and complexity. We empirically validate this explanation by measuring the model complexity of our transformers as defined by the local learning coefficient.


A Unified Perspective on the Dynamics of Deep Transformers

Castin, Valérie, Ablin, Pierre, Carrillo, José Antonio, Peyré, Gabriel

arXiv.org Artificial Intelligence

Transformers, which are state-of-the-art in most machine learning tasks, represent the data as sequences of vectors called tokens. This representation is then exploited by the attention function, which learns dependencies between tokens and is key to the success of Transformers. However, the iterative application of attention across layers induces complex dynamics that remain to be fully understood. To analyze these dynamics, we identify each input sequence with a probability measure and model its evolution as a Vlasov equation called Transformer PDE, whose velocity field is non-linear in the probability measure. Our first set of contributions focuses on compactly supported initial data. We show the Transformer PDE is well-posed and is the mean-field limit of an interacting particle system, thus generalizing and extending previous analysis to several variants of self-attention: multi-head attention, L2 attention, Sinkhorn attention, Sigmoid attention, and masked attention--leveraging a conditional Wasserstein framework. In a second set of contributions, we are the first to study non-compactly supported initial conditions, by focusing on Gaussian initial data. Again for different types of attention, we show that the Transformer PDE preserves the space of Gaussian measures, which allows us to analyze the Gaussian case theoretically and numerically to identify typical behaviors. This Gaussian analysis captures the evolution of data anisotropy through a deep Transformer. In particular, we highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.


Unified Inverse Dynamics of Modular Serial Mechanical Systems with Application to Soft Robotics

Pustina, Pietro, Della Santina, Cosimo, De Luca, Alessandro

arXiv.org Artificial Intelligence

The robotic field has been witnessing a progressive departure from classic robotic systems composed of serial/stiff links interconnected by simple rigid joints. Novel robotic concepts, e.g., soft robots, often maintain a series-like structure, but their mechanical modules exhibit complex and unconventional articulation patterns. Research in efficient recursive formulations of the dynamic models for subclasses of these systems has been extremely active in the past decade. Yet, as of today, no single recursive inverse dynamics algorithm can describe the behavior of all these systems. This paper addresses this challenge by proposing a new iterative formulation based on Kane equations. Its computational complexity is optimal, i.e., linear with the number of modules. While the proposed formulation is not claimed to be necessarily more efficient than state-of-the-art techniques for specific subclasses of robots, we illustrate its usefulness in the modeling of different complex systems. We propose two new models of soft robots: (i) a class of pneumatically actuated soft arms that deform along their cross-sectional area, and (ii) a piecewise strain model with Gaussian functions.