Appendix
The literature for the geometric properties of Riemannian Manifolds is immense and hence we cannot hope to survey them here; for an appetizer, we refer the reader to Burago et al. [93] and Lee [94] and references therein. On the other hand, as stated, it is not until recently that the long-run non-asymptotic behavior of optimization algorithms in Riemannian manifolds (even the smooth ones) has encountered a lot of interest. For concision, we have deferred here a detailed exposition of the rest of recent results to Appendix A of the paper's supplement. Additionally, in Appendix B we also give a bunch of motivating examples which can be solved by Riemannian min-max optimization. Many application problems can be formulated as the minimization or maximization of a smooth function over Riemannian manifold and has triggered a line of research on the extension of the classical first-order and second-order methods to Riemannian setting with asymptotic convergence to first-order stationary points in general [95].
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We show that a key factor in this performance gap is the heavy-tailed class imbalance found in language tasks. When trained with gradient descent, the loss of infrequent words decreases more slowly than the loss of frequent ones. This leads to a slow decrease on the average loss as most samples come from infrequent words. On the other hand, Adam and sign-based methods are less sensitive to this problem. To establish that this behavior is caused by class imbalance, we show empirically that it can be reproduced across architectures and data types, on language transformers, vision CNNs, and linear models. On a linear model with cross-entropy loss, we show that class imbalance leads to imbalanced, correlated gradients and Hessians that have been hypothesized to benefit Adam. We also prove that, in continuous time, gradient descent converges slowly on low-frequency classes while sign descent does not.
Entrywise Convergence of Iterative Methods for Eigenproblems
Several problems in machine learning, statistics, and other fields rely on computing eigenvectors. For large scale problems, the computation of these eigenvectors is typically performed via iterative schemes such as subspace iteration or Krylov methods. While there is classical and comprehensive analysis for subspace convergence guarantees with respect to the spectral norm, in many modern applications other notions of subspace distance are more appropriate.
Appendix: A Probabilistic State Space Model for Joint Inference from Differential Equations and Data Nicholas Krรคmer University of Tรผbingen University of Tรผbingen Tรผbingen, Germany
This section provides detailed information about the state-space model and approximate Gaussian inference therein. Appendix A.1 defines the augmented state-space model that formalizes the dynamics of the Gauss-Markov processes introduced in Section 3.1. Appendix A.2 provides the equations for prediction and update steps of the extended Kalman filter in such a setup, which is described in Section 3.4 (in particular, Algorithm 1). A.1 Augmented state-space model Section 3 describes the joint inference of both a latent process u(t): [t The measurement models are given in Eq. (6) (for observed data) and in Eq. (7) (for ODE measurements). In the experiments presented in Sections 5.2 and 5.3 we model the latent contact rate ฮฒ(t) as a Matรฉrn-3 More details on the use of integrated Wiener processes in probabilistic ODE solvers can be found in, for instance, the work by Kersting et al. [5].
Re-assembling the past: The RePAIR dataset and benchmark for real world 2D and 3D puzzlesolving
This paper proposes the RePAIR dataset that represents a challenging benchmark to test modern computational and data driven methods for puzzle-solving and reassembly tasks. Our dataset has unique properties that are uncommon to current benchmarks for 2D and 3D puzzle solving. The fragments and fractures are realistic, caused by a collapse of a fresco during a World War II bombing at the Pompeii archaeological park. The fragments are also eroded and have missing pieces with irregular shapes and different dimensions, challenging further the reassembly algorithms. The dataset is multi-modal providing high resolution images with characteristic pictorial elements, detailed 3D scans of the fragments and metadata annotated by the archaeologists. Ground truth has been generated through several years of unceasing eldwork, including the excavation and cleaning of each fragment, followed by manual puzzle solving by archaeologists of a subset of approx.
No-regret learning in games with noisy feedback: Faster rates and adaptivity via learning rate separation
We examine the problem of regret minimization when the learner is involved in a continuous game with other optimizing agents: in this case, if all players follow a no-regret algorithm, it is possible to achieve significantly lower regret relative to fully adversarial environments. We study this problem in the context of variationally stable games (a class of continuous games which includes all convexconcave and monotone games), and when the players only have access to noisy estimates of their individual payoff gradients. If the noise is additive, the gametheoretic and purely adversarial settings enjoy similar regret guarantees; however, if the noise is multiplicative, we show that the learners can, in fact, achieve constant regret. We achieve this faster rate via an optimistic gradient scheme with learning rate separation - that is, the method's extrapolation and update steps are tuned to different schedules, depending on the noise profile. Subsequently, to eliminate the need for delicate hyperparameter tuning, we propose a fully adaptive method that attains nearly the same guarantees as its non-adapted counterpart, while operating without knowledge of either the game or of the noise profile.