Ajroldi, Niccolò
When, Where and Why to Average Weights?
Ajroldi, Niccolò, Orvieto, Antonio, Geiping, Jonas
Averaging checkpoints along the training trajectory is a simple yet powerful approach to improve the generalization performance of Machine Learning models and reduce training time. Motivated by these potential gains, and in an effort to fairly and thoroughly benchmark this technique, we present an extensive evaluation of averaging techniques in modern Deep Learning, which we perform using AlgoPerf \citep{dahl_benchmarking_2023}, a large-scale benchmark for optimization algorithms. We investigate whether weight averaging can reduce training time, improve generalization, and replace learning rate decay, as suggested by recent literature. Our evaluation across seven architectures and datasets reveals that averaging significantly accelerates training and yields considerable efficiency gains, at the price of a minimal implementation and memory cost, while mildly improving generalization across all considered workloads. Finally, we explore the relationship between averaging and learning rate annealing and show how to optimally combine the two to achieve the best performances.
Loss Landscape Characterization of Neural Networks without Over-Parametrization
Islamov, Rustem, Ajroldi, Niccolò, Orvieto, Antonio, Lucchi, Aurelien
Optimization methods play a crucial role in modern machine learning, powering the remarkable empirical achievements of deep learning models. These successes are even more remarkable given the complex non-convex nature of the loss landscape of these models. Yet, ensuring the convergence of optimization methods requires specific structural conditions on the objective function that are rarely satisfied in practice. One prominent example is the widely recognized Polyak-Lojasiewicz (PL) inequality, which has gained considerable attention in recent years. However, validating such assumptions for deep neural networks entails substantial and often impractical levels of over-parametrization. In order to address this limitation, we propose a novel class of functions that can characterize the loss landscape of modern deep models without requiring extensive over-parametrization and can also include saddle points. Crucially, we prove that gradient-based optimizers possess theoretical guarantees of convergence under this assumption. Finally, we validate the soundness of our new function class through both theoretical analysis and empirical experimentation across a diverse range of deep learning models.
Conformal Prediction Bands for Two-Dimensional Functional Time Series
Ajroldi, Niccolò, Diquigiovanni, Jacopo, Fontana, Matteo, Vantini, Simone
Functional data analysis (FDA) (Ramsay and Silverman 2005) is naturally apt to represent and model this kind of data, as it allows preserving their continuous nature, and provides a rigorous mathematical framework. Among the others, Zhou and Pan 2014 analyzed temperature surfaces, presenting two approaches for Functional Principal Component Analysis (FPCA) of functions defined on a non-rectangular domain, Porro-Muñoz et al. 2014 focuses on image processing using FDA, while a novel regularization technique for Gaussian random fields on a rectangular domain has been proposed by Rakêt 2010 and applied to 2D electrophoresis images. Another bivariate smoothing approach in a penalized regression framework has been introduced by Ivanescu and Andrada 2013, allowing for the estimation of functional parameters of two-dimensional functional data. As shown by Gervini 2010, even mortality rates can be interpreted as two-dimensional functional data. Whereas in all the reviewed works functions are assumed to be realization of iid or at least exchangeable random objects, to the best of our knowledge there is no literature focusing on forecasting time-dependent two-dimensional functional data. In this work, we focus on time series of surfaces, representing them as two-dimensional Functional Time Series (FTS).