parameter space
Understanding Adam Requires Better Rotation Dependent Assumptions
Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's sensitivity to rotations of the parameter space. We observe that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis in practice. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature and find that they fall short in explaining Adam's behaviour across various rotation types. In contrast, we verify the orthogonality of the update as a promising indicator of Adam's basis sensitivity, suggesting it may be the key quantity for developing rotation-dependent theoretical frameworks that better explain its empirical success.
How to Learn a Star: Binary Classification with Starshaped Polyhedral Sets
We consider binary classification restricted to a class of continuous piecewise linear functions whose decision boundaries are (possibly nonconvex) starshaped polyhedral sets, supported on a fixed polyhedral simplicial fan. We investigate the expressivity of these function classes and describe the combinatorial and geometric structure of the loss landscape, most prominently the sublevel sets, for two loss-functions: the 0/1-loss (discrete loss) and a log-likelihood loss function. In particular, we give explicit bounds on the VC dimension of this model, and concretely describe the sublevel sets of the discrete loss as chambers in a hyperplane arrangement. For the log-likelihood loss, we give sufficient conditions for the optimum to be unique, and describe the geometry of the optimum when varying the rate parameter of the underlying exponential probability distribution.
Uni-LoRA: One Vector is All You Need
Low-Rank Adaptation (LoRA) has become the de facto parameter-efficient finetuning (PEFT) method for large language models (LLMs) by constraining weight updates to low-rank matrices. Recent works such as Tied-LoRA, VeRA, and VBLoRA push efficiency further by introducing additional constraints to reduce the trainable parameter space. In this paper, we show that the parameter space reduction strategies employed by these LoRA variants can be formulated within a unified framework, Uni-LoRA, where the LoRA parameter space, flattened as a highdimensional vector space RD, can be reconstructed through a projection from a subspace Rd, with d D. We demonstrate that the fundamental difference among various LoRA methods lies in the choice of the projection matrix, P RD d. Most existing LoRA variants rely on layer-wise or structure-specific projections that limit cross-layer parameter sharing, thereby compromising parameter efficiency. In light of this, we introduce an efficient and theoretically grounded projection matrix that is isometric, enabling global parameter sharing and reducing computation overhead. Furthermore, under the unified view of Uni-LoRA, this design requires only a single trainable vector to reconstruct LoRA parameters for the entire LLM - making UniLoRA both a unified framework and a "one-vector-only" solution. Extensive experiments on GLUE, mathematical reasoning, and instruction tuning benchmarks demonstrate that Uni-LoRA achieves state-of-the-art parameter efficiency while outperforming or matching prior approaches in predictive performance.
Can MLLMs Absorb Math Reasoning Abilities from LLMs as Free Lunch?
Math reasoning has been one crucial ability of large language models (LLMs), where significant advancements have been achieved in recent years. However, most efforts focus on LLMs by curating high-quality annotation data and intricate training (or inference) paradigms, while the math reasoning performance of multi-modal LLMs (MLLMs) remains lagging behind. Since the MLLM typically consists of an LLM and vision block, we wonder: \textit{Can MLLMs directly absorb math reasoning abilities from off-the-shelf math LLMs without tuning?} Recent model-merging approaches may offer insights into this question. However, they overlook the alignment between the MLLM and LLM, where we find that there is a large gap between their parameter spaces, resulting in lower performance. Our empirical evidence reveals two key factors behind this issue: the identification of crucial reasoning-associated layers in the model and the mitigation of the gaps in parameter space. Based on the empirical insights, we propose \textbf{IP-Merging} that first \textbf{I}dentifies the reasoning-associated parameters in both MLLM and Math LLM, then \textbf{P}rojects them into the subspace of MLLM aiming to maintain the alignment, finally merges parameters in this subspace. IP-Merging is a tuning-free approach since parameters are directly adjusted. Extensive experiments demonstrate that our IP-Merging method can enhance the math reasoning ability of MLLMs directly from Math LLMs without compromising their other capabilities.
Understanding Adam Requires Better Rotation Dependent Assumptions
Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's sensitivity to rotations of the parameter space. We observe that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis in practice. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature and find that they fall short in explaining Adam's behavior across various rotation types. In contrast, we verify the orthogonality of the update as a promising indicator of Adam's basis sensitivity, suggesting it may be the key quantity for developing rotation-dependent theoretical frameworks that better explain its empirical success.
Uni-LoRA: One Vector is All You Need
Low-Rank Adaptation (LoRA) has become the de facto parameter-efficient fine-tuning (PEFT) method for large language models (LLMs) by constraining weight updates to low-rank matrices. Recent works such as Tied-LoRA, VeRA, and VB-LoRA push efficiency further by introducing additional constraints to reduce the trainable parameter space. In this paper, we show that the parameter space reduction strategies employed by these LoRA variants can be formulated within a unified framework, Uni-LoRA, where the LoRA parameter space, flattened as a high-dimensional vector space R^D, can be reconstructed through a projection from a subspace R^d, with d << D. We demonstrate that the fundamental difference among various LoRA methods lies in the choice of the projection matrix, P R^{D d}. Most existing LoRA variants rely on layer-wise or structure-specific projections that limit cross-layer parameter sharing, thereby compromising parameter efficiency. In light of this, we introduce an efficient and theoretically grounded projection matrix that is isometric, enabling global parameter sharing and reducing computation overhead. Furthermore, under the unified view of Uni-LoRA, this design requires only a single trainable vector to reconstruct LoRA parameters for the entire LLM -- making Uni-LoRA both a unified framework and a "one-vector-only" solution. Extensive experiments on GLUE, mathematical reasoning, and instruction tuning benchmarks demonstrate that Uni-LoRA achieves state-of-the-art parameter efficiency while outperforming or matching prior approaches in predictive performance.
Variational Transdimensional Inference
The expressiveness of flow-based models combined with stochastic variational inference (SVI) has expanded the application of optimization-based Bayesian inference to highly complex problems. However, despite the importance of multi-model Bayesian inference for problems defined on a transdimensional joint model and parameter space, such as Bayesian structure learning, flow-based SVI has been limited to problems defined on a fixed-dimensional parameter space. We introduce CoSMIC normalizing flows (COntextually-Specified Masking for Identity-mapped Components), an extension to neural autoregressive conditional normalizing flow architectures that enables use of a single flow-based variational density for inference over a transdimensional (multi-model) conditional target distribution. We propose a combined stochastic variational transdimensional inference (VTI) approach to training CoSMIC flows using ideas from Bayesian optimization and Monte Carlo gradient estimation. Numerical experiments show the performance of VTI on challenging problems that scale to high-cardinality model spaces.
Model Merging on Loss Landscape: A Geometry Perspective
Lu, Juanwu, Bhaskar, Anand, Axelrod, Brian, Tolstaya, Ekaterina, Emrich, Tristan
Model merging offers a promising avenue for knowledge integration and parallel development without retraining. Yet, existing methods either ignore the geometry of the loss landscape or rely on intractable full-space Hessian approximations. We propose EpiMer, a framework that casts model merging as solving the Fréchet mean on a Riemannian manifold and restricts the computation to a low-rank subspace spanned by the task vectors. With the expected Hessian as the metric, we reveal a connection between local curvature and epistemic uncertainty of the parameters. Our theoretical analysis decomposes the merging error bound into the subspace Fréchet variance and the residual energy, and provides a closed-form characterization of when curvature-aware merging provably outperforms flat-geometry methods. In addition, our framework unifies both curvature-aware methods and recent spectral methods as special cases of the subspace Fréchet mean with different geometric metrics. Merging fine-tuned CLIP-ViT models on eight image classification tasks, Epistemic Merging strictly outperforms the baselines on all three CLIP-ViT backbones at matched rank, improving the across-task average accuracy and worst-task accuracy on every backbone.
Don't Stop Me Yet: Sampling Loss Minima via Dissipative Riemannian Mechanics
Jacobsen, Albert Kjøller, Jakobsen, Leo Uhre, Gegenfurtner, Johanna Marie, Arvanitidis, Georgios
The minima of modern neural network loss functions are typically not isolated, rather they form connected components of reparameterization invariant solutions on the training data. Analytically characterizing these solutions is a hard problem, but sampling approaches are feasible. By construction, existing methods either spread over low-loss regions, and thus do not sample reparameterization invariant solutions exactly, or are inherently local, which limits exploration of other minima valleys. We propose sampling such reparameterization invariant models using a dynamical system based on kinetic energy, subject to a gravitational pull and a friction term that dissipates energy from the system. Our proposed sampler, DIMS, is guaranteed to sample exactly from the minimum level sets and depends on physically motivated hyperparameters which allows control over the exploration capabilities of the sampler. We consider uncertainty quantification in Bayesian inference as the motivating problem and observe improved performance compared to previously proposed approaches.