Pinto, Andrea
On Generalization Bounds for Neural Networks with Low Rank Layers
Pinto, Andrea, Rangamani, Akshay, Poggio, Tomaso
Deep learning has achieved remarkable success across a wide range of applications, including computer vision[2, 3], natural language processing [4, 5], decision-making in novel environments [6], and code generation [7], among others. Understanding the reasons behind the effectiveness of deep learning is a multifaceted challenge that involves questions about architectural choices, optimizer selection, and the types of inductive biases that can guarantee generalization. A long-standing question in this field is how deep learning finds solutions that generalize well. While good generalization performance by overparameterized models is not unique to deep learning--it can be explained by the implicit bias of learning algorithms towards low-norm solutions in linear models and kernel machines [8, 9]--the case of deep learning presents additional challenges. However in the case of deep learning, identifying the right implicit bias and obtaining generalization bounds that depend on this bias are still open questions. In recent years, Rademacher bounds have been developed to explain the complexity control induced by an important bias in deep network training: the minimization of weight matrix norms. This minimization occurs due to explicit or implicit regularization [10, 11, 12, 13]. For rather general network architectures, Golowich et al.[14] showed that the Rademacher complexity is linear in the product of the Frobenius norms of the various layers. Although the associated bounds are usually orders of magnitude larger than the generalization gap for dense networks, very recent results by Galanti et al. [15] demonstrate that for networks with structural sparsity in their weight matrices, such as convolutional networks, norm-based Rademacher bounds approach non-vacuity.
The Fair Language Model Paradox
Pinto, Andrea, Galanti, Tomer, Balestriero, Randall
Large Language Models (LLMs) are widely deployed in real-world applications, yet little is known about their training dynamics at the token level. Evaluation typically relies on aggregated training loss, measured at the batch level, which overlooks subtle per-token biases arising from (i) varying token-level dynamics and (ii) structural biases introduced by hyperparameters. While weight decay is commonly used to stabilize training, we reveal that it silently introduces performance biases detectable only at the token level. In fact, we empirically show across different dataset sizes, model architectures and sizes ranging from 270M to 3B parameters that as weight decay increases, low-frequency tokens are disproportionately depreciated. This is particularly concerning, as these neglected low-frequency tokens represent the vast majority of the token distribution in most languages, calling for novel regularization techniques that ensure fairness across all available tokens.
How Neural Networks Learn the Support is an Implicit Regularization Effect of SGD
Beneventano, Pierfrancesco, Pinto, Andrea, Poggio, Tomaso
We investigate the ability of deep neural networks to identify the support of the target function. Our findings reveal that mini-batch SGD effectively learns the support in the first layer of the network by shrinking to zero the weights associated with irrelevant components of input. In contrast, we demonstrate that while vanilla GD also approximates the target function, it requires an explicit regularization term to learn the support in the first layer. We prove that this property of mini-batch SGD is due to a second-order implicit regularization effect which is proportional to $\eta / b$ (step size / batch size). Our results are not only another proof that implicit regularization has a significant impact on training optimization dynamics but they also shed light on the structure of the features that are learned by the network. Additionally, they suggest that smaller batches enhance feature interpretability and reduce dependency on initialization.
Privacy and Efficiency of Communications in Federated Split Learning
Zhang, Zongshun, Pinto, Andrea, Turina, Valeria, Esposito, Flavio, Matta, Ibrahim
Everyday, large amounts of sensitive data is distributed across mobile phones, wearable devices, and other sensors. Traditionally, these enormous datasets have been processed on a single system, with complex models being trained to make valuable predictions. Distributed machine learning techniques such as Federated and Split Learning have recently been developed to protect user data and privacy better while ensuring high performance. Both of these distributed learning architectures have advantages and disadvantages. In this paper, we examine these tradeoffs and suggest a new hybrid Federated Split Learning architecture that combines the efficiency and privacy benefits of both. Our evaluation demonstrates how our hybrid Federated Split Learning approach can lower the amount of processing power required by each client running a distributed learning system, reduce training and inference time while keeping a similar accuracy. We also discuss the resiliency of our approach to deep learning privacy inference attacks and compare our solution to other recently proposed benchmarks.
The effectiveness of factorization and similarity blending
Pinto, Andrea, Camposampiero, Giacomo, Houmard, Loïc, Lundwall, Marc
Collaborative Filtering (CF) is a widely used technique which allows to leverage past users' preferences data to identify behavioural patterns and exploit them to predict custom recommendations. In this work, we illustrate our review of different CF techniques in the context of the Computational Intelligence Lab (CIL) CF project at ETH Z\"urich. After evaluating the performances of the individual models, we show that blending factorization-based and similarity-based approaches can lead to a significant error decrease (-9.4%) on the best-performing stand-alone model. Moreover, we propose a novel stochastic extension of a similarity model, SCSR, which consistently reduce the asymptotic complexity of the original algorithm.