warm-up
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > Jordan (0.04)
FedAPA: Federated Learning with Adaptive Prototype Aggregation Toward Heterogeneous Wi-Fi CSI-based Crowd Counting
Guo, Jingtao, Mao, Yuyi, Ho, Ivan Wang-Hei
Wi-Fi channel state information (CSI)-based sensing provides a non-invasive, device-free approach for tasks such as human activity recognition and crowd counting, but large-scale deployment is hindered by the need for extensive site-specific training data. Federated learning (FL) offers a way to avoid raw data sharing but is challenged by heterogeneous sensing data and device resources. This paper proposes FedAPA, a collaborative Wi-Fi CSI-based sensing algorithm that uses adaptive prototype aggregation (APA) strategy to assign similarity-based weights to peer prototypes, enabling adaptive client contributions and yielding a personalized global prototype for each client instead of a fixed-weight aggregation. During local training, we adopt a hybrid objective that combines classification learning with representation contrastive learning to align local and global knowledge. We provide a convergence analysis of FedAPA and evaluate it in a real-world distributed Wi-Fi crowd counting scenario with six environments and up to 20 people. The results show that our method outperform multiple baselines in terms of accuracy, F1 score, mean absolute error (MAE), and communication overhead, with FedAPA achieving at least a 9.65% increase in accuracy, a 9% gain in F1 score, a 0.29 reduction in MAE, and a 95.94% reduction in communication overhead.
How to shovel snow without landing in the emergency room
Avoid injury and improve efficiency with tips from a physical therapist. Don't be a snow hero. Breakthroughs, discoveries, and DIY tips sent every weekday. You know, for life's most essential resource, water knows a hundred ways to kill you if you're not careful. When it's not trying to drown you in its pools and coastlines during the summer, it shape-shifts to snow in the winter, piling up emergency room visits for those forced to shovel it.
Learning in Compact Spaces with Approximately Normalized Transformer
Franke, Jörg K. H., Spiegelhalter, Urs, Nezhurina, Marianna, Jitsev, Jenia, Hutter, Frank, Hefenbrock, Michael
The successful training of deep neural networks requires addressing challenges such as overfitting, numerical instabilities leading to divergence, and increasing variance in the residual stream. A common solution is to apply regularization and normalization techniques that usually require tuning additional hyperparameters. An alternative is to force all parameters and representations to lie on a hypersphere. This removes the need for regularization and increases convergence speed, but comes with additional costs. In this work, we propose a more holistic, approximate normalization via simple scalar multiplications motivated by the tight concentration of the norms of high-dimensional random vectors. Additionally, instead of applying strict normalization for the parameters, we constrain their norms. These modifications remove the need for weight decay and learning rate warm-up as well, but do not increase the total number of normalization layers. Our experiments with transformer architectures show up to 40% faster convergence compared to GPT models with QK normalization, with only 3% additional runtime cost. When deriving scaling laws, we found that our method enables training with larger batch sizes while preserving the favorable scaling characteristics of classic GPT architectures.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Italy (0.04)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- (3 more...)
Why Do We Need Warm-up? A Theoretical Perspective
Alimisis, Foivos, Islamov, Rustem, Lucchi, Aurelien
Training modern machine learning models requires a careful choice of hyperparameters. A common practice for setting the learning rate (LR) is to linearly increase the LR in the beginning (warm-up stage) [Goyal et al., 2017, Vaswani et al., 2017] and gradually decrease at the end of the training (decay stage) [Loshchilov and Hutter, 2016, Vaswani et al., 2017, Hoffmann et al., 2022b, Zhang et al., 2023, Dremov et al., 2025]. Decaying the LR is a classical requirement in the theoretical analysis of SGD, ensuring convergence under broad conditions [Defazio et al., 2023, Gower et al., 2021], and it has been consistently observed to improve empirical performance [Loshchilov and Hutter, 2016, Hu et al., 2024, Hägele et al., 2024]. Recent work further demonstrates that decaying step sizes can improve theoretical guarantees by yielding tighter bounds [Schaipp et al., 2025]. By contrast, the practice of linearly increasing the LR at the start of training (warm-up phase) has become nearly ubiquitous in modern deep learning [He et al., 2016, Hu et al., 2024, Hägele et al., 2024], yet a clear theoretical understanding of why it helps optimization remains elusive. This raises the central question we address in this paper: Why does LR warm-up improve training, and under what conditions can its benefits be theoretically justified?
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Asia > China (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > Jordan (0.04)
impressively engineered combination of multiple different techniques with associated hyperparameters: a warm-up
We are grateful to the reviewers for their time and their thoughtful comments, which we believe will improve the paper. We first clarify the comparison with DivideMix and then address all individual comments below. Following the same approach as DivideMix, the accuracy for ELR+ is the same or even higher (e.g. We will explain all of this in our revision, including possible limitations as suggested by Reviewer 3. The memorization effect is not new to the community. We believe that it may be due to a reduction in confirmation bias.