tgt
Generative Modeling by Value-Driven Transport
Moreno-Muñoz, Pablo, Müller, Adrian, Neu, Gergely
We propose a new framework for generative modeling based on a discrete-time stochastic control formulation of measure transport. Adapting classic results from control theory, we formulate our problem as a linear program whose dual variables correspond to the \emph{optimal value function} of the control problem, which directly encodes the optimal control policy. Exploiting this LP formulation, we develop an efficient simulation-free primal-dual algorithm for computing approximately optimal value functions and the associated \emph{value-driven transport} (VDT) policies which approximate the true optimal policy. We show that well-trained VDT policies enjoy numerous favorable properties in comparison with other state-of-the-art methods based on flows, diffusions, or Schrödinger bridges: they lead to straight transport paths which can be simulated quickly and robustly, and can be enhanced in all the same ways as diffusion and flow-based models (e.g., conditional generation, classifier-free guidance, unpaired data-to-data translation are all easy to incorporate). We evaluate our methodology in a range of experiments, with results that indicate strong performance and good potential for scalability.
A Theoretical Framework for LLM Fine-tuning Using Early Stopping for Non-random Initialization
Sun, Zexuan, Raskutti, Garvesh
In the era of large language models (LLMs), fine-tuning pretrained models has become ubiquitous. Yet the theoretical underpinning remains an open question. A central question is why only a few epochs of fine-tuning are typically sufficient to achieve strong performance on many different tasks. In this work, we approach this question by developing a statistical framework, combining rigorous early stopping theory with the attention-based Neural Tangent Kernel (NTK) for LLMs, offering new theoretical insights on fine-tuning practices. Specifically, we formally extend classical NTK theory [Jacot et al., 2018] to non-random (i.e., pretrained) initializations and provide a convergence guarantee for attention-based fine-tuning. One key insight provided by the theory is that the convergence rate with respect to sample size is closely linked to the eigenvalue decay rate of the empirical kernel matrix induced by the NTK. We also demonstrate how the framework can be used to explain task vectors for multiple tasks in LLMs. Finally, experiments with modern language models on real-world datasets provide empirical evidence supporting our theoretical insights.
A Appendix
In the appendix, we have the following results. In Appendix A.1, we summarize the main notations used in this paper. In Appendix A.2 - A.9, we show all the proofs of our theoretical results. In Appendix A.10, we present the overall training procedures (e.g., pseudo code) of our proposed DINO-INIT and DINO-TRAIN algorithms, as well as the limitations of our work. Assume that all the parameters of f() follows standard normal distribution, in the limits as the layer width d!1, the output function of the distribution-informed neural network f(x) in Eq (5) at initialization is iid centered Gaussian process, i.e., f() N 0, K Using the definition of the distribution kernel in Eq. (6), we have K It is shown [4] that the key difference between NNGP kernel and NTK is that NTK is generated by a fully-trained neural network, whereas NNGP kernel is produced by a weakly-trained neural network.