AITopics | warm-up

Wi-Fi channel state information (CSI)-based sensing provides a non-invasive, device-free approach for tasks such as human activity recognition and crowd counting, but large-scale deployment is hindered by the need for extensive site-specific training data. Federated learning (FL) offers a way to avoid raw data sharing but is challenged by heterogeneous sensing data and device resources. This paper proposes FedAPA, a collaborative Wi-Fi CSI-based sensing algorithm that uses adaptive prototype aggregation (APA) strategy to assign similarity-based weights to peer prototypes, enabling adaptive client contributions and yielding a personalized global prototype for each client instead of a fixed-weight aggregation. During local training, we adopt a hybrid objective that combines classification learning with representation contrastive learning to align local and global knowledge. We provide a convergence analysis of FedAPA and evaluate it in a real-world distributed Wi-Fi crowd counting scenario with six environments and up to 20 people. The results show that our method outperform multiple baselines in terms of accuracy, F1 score, mean absolute error (MAE), and communication overhead, with FedAPA achieving at least a 9.65% increase in accuracy, a 9% gain in F1 score, a 0.29 reduction in MAE, and a 95.94% reduction in communication overhead.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.21048

Genre: Research Report > New Finding (0.48)

Industry: Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

How to shovel snow without landing in the emergency room

Avoid injury and improve efficiency with tips from a physical therapist. Don't be a snow hero. Breakthroughs, discoveries, and DIY tips sent every weekday. You know, for life's most essential resource, water knows a hundred ways to kill you if you're not careful. When it's not trying to drown you in its pools and coastlines during the summer, it shape-shifts to snow in the winter, piling up emergency room visits for those forced to shovel it.

artificial intelligence, crick, snow, (15 more...)

Popular Science

Country: Asia > Middle East > Jordan (0.06)

Industry:

Health & Medicine > Health Care Providers & Services (1.00)
Health & Medicine > Consumer Health (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.47)

Technology: Information Technology > Artificial Intelligence (0.49)

Add feedback

Learning in Compact Spaces with Approximately Normalized Transformer

Franke, Jörg K. H., Spiegelhalter, Urs, Nezhurina, Marianna, Jitsev, Jenia, Hutter, Frank, Hefenbrock, Michael

arXiv.org Artificial IntelligenceNov-20-2025

The successful training of deep neural networks requires addressing challenges such as overfitting, numerical instabilities leading to divergence, and increasing variance in the residual stream. A common solution is to apply regularization and normalization techniques that usually require tuning additional hyperparameters. An alternative is to force all parameters and representations to lie on a hypersphere. This removes the need for regularization and increases convergence speed, but comes with additional costs. In this work, we propose a more holistic, approximate normalization via simple scalar multiplications motivated by the tight concentration of the norms of high-dimensional random vectors. Additionally, instead of applying strict normalization for the parameters, we constrain their norms. These modifications remove the need for weight decay and learning rate warm-up as well, but do not increase the total number of normalization layers. Our experiments with transformer architectures show up to 40% faster convergence compared to GPT models with QK normalization, with only 3% additional runtime cost. When deriving scaling laws, we found that our method enables training with larger batch sizes while preserving the favorable scaling characteristics of classic GPT architectures.

large language model, machine learning, normalization, (19 more...)

arXiv.org Artificial Intelligence

2505.22014

Country: Europe > Germany (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Why Do We Need Warm-up? A Theoretical Perspective

Alimisis, Foivos, Islamov, Rustem, Lucchi, Aurelien

arXiv.org Machine LearningOct-6-2025

Training modern machine learning models requires a careful choice of hyperparameters. A common practice for setting the learning rate (LR) is to linearly increase the LR in the beginning (warm-up stage) [Goyal et al., 2017, Vaswani et al., 2017] and gradually decrease at the end of the training (decay stage) [Loshchilov and Hutter, 2016, Vaswani et al., 2017, Hoffmann et al., 2022b, Zhang et al., 2023, Dremov et al., 2025]. Decaying the LR is a classical requirement in the theoretical analysis of SGD, ensuring convergence under broad conditions [Defazio et al., 2023, Gower et al., 2021], and it has been consistently observed to improve empirical performance [Loshchilov and Hutter, 2016, Hu et al., 2024, Hägele et al., 2024]. Recent work further demonstrates that decaying step sizes can improve theoretical guarantees by yielding tighter bounds [Schaipp et al., 2025]. By contrast, the practice of linearly increasing the LR at the start of training (warm-up phase) has become nearly ubiquitous in modern deep learning [He et al., 2016, Hu et al., 2024, Hägele et al., 2024], yet a clear theoretical understanding of why it helps optimization remains elusive. This raises the central question we address in this paper: Why does LR warm-up improve training, and under what conditions can its benefits be theoretically justified?

dist, vec, warm-up, (15 more...)

arXiv.org Machine Learning

2510.03164

Country:

Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Switzerland > Basel-City > Basel (0.04)
Asia > China (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)

Add feedback

impressively engineered combination of multiple different techniques with associated hyperparameters: a warm-up

Neural Information Processing SystemsAug-17-2025, 02:58:54 GMT

We are grateful to the reviewers for their time and their thoughtful comments, which we believe will improve the paper. We first clarify the comparison with DivideMix and then address all individual comments below. Following the same approach as DivideMix, the accuracy for ELR+ is the same or even higher (e.g. We will explain all of this in our revision, including possible limitations as suggested by Reviewer 3. The memorization effect is not new to the community. We believe that it may be due to a reduction in confirmation bias.

artificial intelligence, dividemix, machine learning, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback