AITopics

2605.20716

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Therapeutic Area (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

arXiv.org Machine LearningMay-22-2026

Three Costs of Amortizing Gaussian Process Inference with Neural Processes

Young, Robin

Neural processes amortize Gaussian process inference, replacing the exact $O(n^3)$ posterior with a learned $O(n)$ map from context sets to predictive distributions. For a class of latent neural processes, we bound the Kullback--Leibler (KL) divergence between the GP and LNP predictives, decomposing it into three interpretable sources, namely label contamination as the neural process uses label values to estimate a quantity that is label-independent in the exact GP, an information bottleneck because the finite-dimensional representation cannot resolve the full context geometry, and amortization error from a single encoder network shared across all contexts. The bottleneck truncation term decays in the representation dimension $d$ as $O(e^{-cd^{2/d_x}})$ for squared-exponential kernels on $\mathbb{R}^{d_x}$ where $c > 0$ is a kernel-dependent constant and as $O(d^{-2ν/d_x})$ for Matérn-$ν$ kernels, directly linking architecture sizing to kernel smoothness and input dimension. The label contamination term is $O(1)$ in general, with only the observation-noise component decaying as $O(1/n)$, identifying a persistent cost of routing uncertainty estimation through a label-dependent representation. These results characterize the costs of amortization within the analyzed class and yield architectural recommendations to predict variance from context locations alone in the GP-amortization regime, and replace mean aggregation with second-order pooling to close the dominant amortization gap.

artificial intelligence, machine learning, variance, (20 more...)

2605.21798

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (1.00)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Modeling & Simulation (0.70)

arXiv.org Machine LearningMay-20-2026

When Individually Calibrated Models Become Collectively Miscalibrated

Wang, Zhaohui

A natural assumption is that if each model is individually calibrated, the aggregate prediction will also be well calibrated. We show that this assumption fails in multi-agent settings: individually calibrated predictors can become collectively miscalibrated when their predictions interact strategically--where "strategically" refers to the game-theoretic sense of Brier-optimal local response, not deliberate gaming or collusion, and arises naturally whenever agents are independently trained on overlapping data. This phenomenon affects multiple independent agents in federated healthcare, multi-vendor intrusion detection, and crowdsourced forecasting, where agents optimize their own objectives. Specifically, we prove that under Brier-score-based aggregation with positively correlated beliefs each agent's individually optimal report systematically underestimates the positive-class probability, yielding a Price of Anarchy strictly greater than one whenever Cov(bi,bj) > 0. At our canonical setting (n=5 agents, pairwise correlation ρ=0.5, base rate µ=0.3, threshold τ=0.3) the empirically measured PoA in false-negative rate is 7.25 (mean aggregate bias 0.375). In contrast, VCG-based aggregation, which rewards each agent's marginal contribution to aggregate accuracy, achieves dominant-strategy incentive compatibility and the lowest empirical PoA among all mechanisms studied (PoA 1.0). On three real-world datasets (NSL-KDD, UNSW-NB15, Credit Card Fraud) with featurepartitioned agents, VCG provides the strongest robustness guarantees among the aggregation methods we evaluate, while maintaining comparable accuracy. In data-sparse regimes (n 500), VCG consistently outperforms stacking and majority voting; under adversarial agents, VCG maintains substantially lower false-negative rates than robust aggregation baselines. Adaptive weight updates further reduce false negatives by 20-22% under distribution shift, with O( T) online regret guarantees. These results establish that how probabilistic predictions are aggregated matters as much as how well individual models are calibrated.

agent, artificial intelligence, machine learning, (19 more...)

2605.18858

Country: North America > United States (0.45)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Law Enforcement & Public Safety (0.87)
Health & Medicine > Therapeutic Area > Endocrinology (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)

Chen, Hao, Bozorgasl, Zavareh

Resource-Element Energy Difference for Noncoherent Over-the-Air Federated Learning

arXiv.org Machine LearningMay-19-2026

Over-the-air federated learning (OTA-FL) reduces uplink latency by aggregating client updates directly over the wireless multiple-access channel. Coherent analog aggregation realizes this idea by aligning the phases and amplitudes of simultaneously transmitted waveforms, which typically requires synchronization, instantaneous channel-state information (CSI), phase compensation, and power control. Noncoherent energy detection removes the need for phase-coherent combining, but a single energy measurement is nonnegative and, therefore, cannot represent signed model updates. This paper introduces resource-element energy difference (REED), a noncoherent physical-layer primitive for continuous signed aggregation. REED maps the positive and negative parts of each real-valued update to transmit energies on paired orthogonal resource elements and estimates the signed sum by subtracting the corresponding received energies. The construction uses slow-timescale calibration of average channel powers, but does not require instantaneous transmitter- or receiver-side CSI or channel inversion. For independent Rayleigh fading, we derive exact first- and second-moment expressions for single-shot REED and for a chip-diverse extension that spreads each coordinate over multiple independently faded paired chips. The resulting variance laws separate fading-induced self-noise, signal-noise interaction, and receiver-noise fluctuation, giving an explicit diversity-resource tradeoff. More->The rest of abstract is in the paper.

aggregation, artificial intelligence, machine learning, (17 more...)

2605.07263

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Vankadara, Leena Chennuru, Haas, Moritz, Hayward, Luke, Bordt, Sebastian, Breccia, Alessandro

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

arXiv.org Machine LearningMay-15-2026

Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale. We take a principled step toward resolving this gap by analyzing three different scaling regimes: (I) co-scaling $N\asymp N_e$, (II) co-scaling $N\asymp M\asymp K$, and (III) full proportional scaling of $N, N_e, M$, and $K$. For each regime, we develop a novel Dynamical Mean Field Theory (DMFT) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal-update ($μ$) desiderata. We then show that the resulting $μ$P prescription does not reliably induce monotonic improvement with scale or robust learning-rate transfer. We trace these pathologies to scale-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability. Guided by this principle, we derive a Maximally Scale-Stable Parameterization (MSSP) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics - qualitatively distinct from the $μ$P limit - through a separate DMFT analysis. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes. Combined with existing depth-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts.

large language model, machine learning, natural language, (19 more...)

2605.142

Country: North America (0.45)

Genre: Research Report (0.40)

Industry: Health & Medicine (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.65)

arXiv.org Machine LearningMay-8-2026

Hedging Memory Horizons for Non-Stationary Prediction via Online Aggregation

Wang, Yutong, Goude, Yannig, Yao, Qiwei

We study online prediction under distribution shift, where inputs arrive chronologically and outcomes are revealed only after prediction. In this setting, predictors must remain stable in quiet regimes yet adapt when regimes shift, and the right adaptation memory is unknown in advance. We propose MELO (Memory-hedged Exponentially Weighted Least-Squares Online aggregation), a model-agnostic method that hedges across adaptation scales: it wraps any non-anticipating base-predictor pool with exponentially weighted least-squares (EWLS) adaptation experts at multiple forgetting factors, and aggregates raw and EWLS-adapted forecasts with MLpol which is a parameter-free online aggregation rule. Under boundedness conditions, we establish deterministic oracle inequalities showing that it competes with both the best raw predictor and the best bounded, time-varying affine combinations of the base predictions, up to a path-length-dependent tracking cost and a sublinear aggregation overhead. We evaluate MELO on French national electricity-load forecasting through the COVID-19 lockdown using no regime indicators, lockdown dates, or policy covariates. MELO reduces overall RMSE by 34.7%relative to base-only MLpol and achieves lower overall RMSE than a TabICL reference supplied with an external COVID policy-response covariate. MELO requires only lightweight per-step recursive updates without model retraining.

data mining, machine learning, prediction, (20 more...)

2605.06541

Country: Europe (0.28)

Genre: Research Report (1.00)

Industry: Energy > Power Industry (0.88)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
(2 more...)

Neural Information Processing SystemsMay-1-2026, 01:51:05 GMT

Focal Modulation Networks

We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation module for modeling token interactions in vision. Focal modulation comprises three components: (i)hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to fuse the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational cost on the tasks of image classification, object detection, and semantic segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K.

arxiv preprint arxiv, machine learning, natural language, (17 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsApr-30-2026, 19:24:51 GMT

0004d0b59e19461ff126e3a08a814c33-Supplemental.pdf

artificial intelligence, dataset, machine learning, (17 more...)

Genre: Research Report (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Neural Information Processing SystemsApr-30-2026, 19:24:43 GMT

AGraph Similarity for Deep Learning

Graph neural networks (GNNs) have been successful in learning representations from graphs. Many popular GNNs follow the pattern of aggregate-transform: they aggregate the neighbors' attributes and then transform the results of aggregation with a learnable function. Analyses of these GNNs explain which pairs of non-identical graphs have different representations. However, we still lack an understanding of how similar these representations will be. We adopt kernel distance and propose transform-sum-cat as an alternative to aggregate-transform to reflect the continuous similarity between the node neighborhoods in the neighborhood aggregation. The idea leads to a simple and efficient graph similarity, which we name Weisfeiler-Leman similarity (WLS). In contrast to existing graph kernels, WLS is easy to implement with common deep learning frameworks. In graph classification experiments, transform-sum-cat significantly outperforms other neighborhood aggregation methods from popular GNN models. We also develop a simple and fast GNN model based on transform-sum-cat, which obtains, in comparison with widely used GNN models, (1) a higher accuracy in node classification, (2) a lower absolute error in graph regression, and (3) greater stability in adversarial training of graph generation.

artificial intelligence, deep learning, machine learning, (20 more...)

Genre: Research Report (0.46)

Industry: Information Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsApr-30-2026, 19:24:32 GMT

0004d0b59e19461ff126e3a08a814c33-AuthorFeedback.pdf

We sincerely appreciate the reviewers for their careful reading, constructive questions and suggestions. We would very1 much like further exchanges to improve our work, but the following is our best effort within the current limits.2 First, we address questions appeared at least twice. We write P1, P2 for paragraph reference, and Rx for reviewers.3 We discuss two main motivations here: lack of graph loss, and empirical failure4 of distinguishing power.

artificial intelligence, machine learning, representation, (16 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.72)