ccuracy
Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise
Highly over-parameterized models can simultaneously memorize noisy labels and generalize well, yet how these behaviors coexist remains poorly understood. In this work, we investigate the underlying mechanisms of this coexistence using modular arithmetic tasks under heavy label noise. Through extensive experiments on two-layer neural networks, we find that larger models tend to generalize better under appropriate optimization and model configurations, while noisy labels are memorized faster than clean data. Over-parameterized models internally form a generalization structure, but its expression in the output is suppressed by the need to fit noisy labels. Remarkably, even with 80\% label noise, near-perfect test accuracy can be achieved by extracting this internal structure using frequency-based methods. We further propose a task-agnostic method to partition networks into generalization and memorization components. Although this subnetwork improves generalization, it is limited compared with frequency-based extraction, indicating that the generalization structure is distributed across neurons and motivating the development of new tools to retrieve generalizable knowledge from over-parameterized networks.
Revisiting Out of distribution Robustness in NLP Benchmark Analysis and LLMs Evaluations
We find that the distribution shift settings in previous studies commonly lack adequate challenges, hindering the accurate evaluation of OOD robustness. To address these issues, we propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts. Then we introduce BOSS, a Benchmark suite for Out-of-distribution robustneSS evaluation covering 5 tasks and 20 datasets. Based on BOSS, we conduct a series of experiments on pretrained language models for analysis and evaluation of OOD robustness. First, for vanilla fine-tuning, we examine the relationship between in-distribution (ID) and OOD performance. We identify three typical types that unveil the inner learning mechanism, which could potentially facilitate the forecasting of OOD robustness, correlating with the advancements on ID datasets. Then, we evaluate 5 classic methods on BOSS and find that, despite exhibiting some effectiveness in specific cases, they do not offer significant improvement compared to vanilla fine-tuning. Further, we evaluate 5 LLMs with various adaptation paradigms and find that when sufficient ID data is available, fine-tuning domain-specific models outperform LLMs on ID examples significantly.
Stimulative Training of Residual Networks: ASocial Psychology Perspective of Loafing
We further verify that stimulative training can well handle the loafing problem on different datasets and residual networks. As shown in Fig. r1, we can see that stimulative training can always improve the performance of a given residual network and all of its sub-networks by a larger margin on various residual networks and benchmark datasets. In other words, different residual networks trained on different datasets invariably suffer from the problem of network loafing, which can be well solved by the proposed stimulative training strategy. Figure r1: Stimulative training can improve the performance of a given residual network and all of its sub-networks significantly. We further verify it on various residual networks and benchmark datasets.
Leveraging Teleconnections with Physics-Informed Graph Attention Networks for Long-Range Extreme Rainfall Forecasting in Thailand
Chobtham, Kiattikun, Sarinnapakorn, Kanoksri, Torsri, Kritanai, Deeprasertkul, Prattana, Kamma, Jirawan
Accurate rainfall forecasting, particularly for extreme events, remains a significant challenge in climatology and the Earth system. This paper presents novel physics-informed Graph Neural Networks (GNNs) combined with extreme-value analysis techniques to improve gauge-station rainfall predictions across Thailand. The model leverages a graph-structured representation of gauge stations to capture complex spatiotemporal patterns, and it offers explainability through teleconnections. We preprocess relevant climate indices that potentially influence regional rainfall. The proposed Graph Attention Network with Long Short-Term Memory (Attention-LSTM) applies the attention mechanism using initial edge features derived from simple orographic-precipitation physics formulation. The embeddings are subsequently processed by LSTM layers. To address extremes, we perform Peak-Over-Threshold (POT) mapping using the novel Spatial Season-aware Generalized Pareto Distribution (GPD) method, which overcomes limitations of traditional machine-learning models. Experiments demonstrate that our method outperforms well-established baselines across most regions, including areas prone to extremes, and remains strongly competitive with the state of the art. Compared with the operational forecasting system SEAS5, our real-world application improves extreme-event prediction and offers a practical enhancement to produce high-resolution maps that support decision-making in long-term water management.
Supplementary Materials A Constraint Explanation for Problem 2 Constraint 2b checks the loss of each sample and the derivation is shown as follows
Constraint 2b checks the loss of each sample and the derivation is shown as follows. Constraint 2d to 2i are mainly adopted from Bertsimas and Dunn [2019] on Chapter 8.2. Here we briefly explain the meaning and derivations of these constraints. Constraint 2g and 2h are set to ensure that if there's no split on Constraint 2i enforces the hierarchical structure of the tree. Algorithm 1 depicts the details of the Branch-and-bound scheme for training the optimal decision tree.
Experimenting with Normalization Layers in Federated Learning on non-IID scenarios
Casella, Bruno, Esposito, Roberto, Sciarappa, Antonio, Cavazzoni, Carlo, Aldinucci, Marco
Training Deep Learning (DL) models require large, high-quality datasets, often assembled with data from different institutions. Federated Learning (FL) has been emerging as a method for privacy-preserving pooling of datasets employing collaborative training from different institutions by iteratively globally aggregating locally trained models. One critical performance challenge of FL is operating on datasets not independently and identically distributed (non-IID) among the federation participants. Even though this fragility cannot be eliminated, it can be debunked by a suitable optimization of two hyper-parameters: layer normalization methods and collaboration frequency selection. In this work, we benchmark five different normalization layers for training Neural Networks (NNs), two families of non-IID data skew, and two datasets. Results show that Batch Normalization, widely employed for centralized DL, is not the best choice for FL, whereas Group and Layer Normalization consistently outperform Batch Normalization. Similarly, frequent model aggregation decreases convergence speed and mode quality.