aggregation
Decision-Path Patterns as Tree Reliability Signals: Path-based Adaptive Weighting for Random Forest Classification
The global uniform aggregation of random forests leaves conditional bias along the decision boundary uncorrected. To correct this locally, we propose exploiting the structural pattern of each tree's decision path. At inference, a random forest reaches its prediction through the root-to-leaf path the sample traverses in each tree, so path-level reliability offers a finer granularity than tree-level weighting can access. We show that reliability varies meaningfully across path patterns in the boundary region identified by the forest itself, and that using this signal yields a statistically significant accuracy improvement over RF on 36 binary classification benchmarks (Wilcoxon p < 0.0001). We further devise a way to measure the sufficiency of residual information in the fitted RF's decision boundary, providing an estimate of the expected gain before the method is applied; on the qualifying group identified this way, the method delivers a mean +0.99 pp accuracy improvement with strict wins on every dataset (7/0/0). Class-recall regression -- the typical failure mode of RF correction methods -- is measured: zero minority-recall regressions and a single majority-recall regression at the 0.2 pp threshold, indicating that the correction operates in the direction of bias reduction rather than class trade-off. Our work suggests that the structural information of decision paths, previously overlooked in random forest research, can contribute to RF performance improvement.
Three Costs of Amortizing Gaussian Process Inference with Neural Processes
Neural processes amortize Gaussian process inference, replacing the exact $O(n^3)$ posterior with a learned $O(n)$ map from context sets to predictive distributions. For a class of latent neural processes, we bound the Kullback--Leibler (KL) divergence between the GP and LNP predictives, decomposing it into three interpretable sources, namely label contamination as the neural process uses label values to estimate a quantity that is label-independent in the exact GP, an information bottleneck because the finite-dimensional representation cannot resolve the full context geometry, and amortization error from a single encoder network shared across all contexts. The bottleneck truncation term decays in the representation dimension $d$ as $O(e^{-cd^{2/d_x}})$ for squared-exponential kernels on $\mathbb{R}^{d_x}$ where $c > 0$ is a kernel-dependent constant and as $O(d^{-2ฮฝ/d_x})$ for Matรฉrn-$ฮฝ$ kernels, directly linking architecture sizing to kernel smoothness and input dimension. The label contamination term is $O(1)$ in general, with only the observation-noise component decaying as $O(1/n)$, identifying a persistent cost of routing uncertainty estimation through a label-dependent representation. These results characterize the costs of amortization within the analyzed class and yield architectural recommendations to predict variance from context locations alone in the GP-amortization regime, and replace mean aggregation with second-order pooling to close the dominant amortization gap.
When Individually Calibrated Models Become Collectively Miscalibrated
A natural assumption is that if each model is individually calibrated, the aggregate prediction will also be well calibrated. We show that this assumption fails in multi-agent settings: individually calibrated predictors can become collectively miscalibrated when their predictions interact strategically--where "strategically" refers to the game-theoretic sense of Brier-optimal local response, not deliberate gaming or collusion, and arises naturally whenever agents are independently trained on overlapping data. This phenomenon affects multiple independent agents in federated healthcare, multi-vendor intrusion detection, and crowdsourced forecasting, where agents optimize their own objectives. Specifically, we prove that under Brier-score-based aggregation with positively correlated beliefs each agent's individually optimal report systematically underestimates the positive-class probability, yielding a Price of Anarchy strictly greater than one whenever Cov(bi,bj) > 0. At our canonical setting (n=5 agents, pairwise correlation ฯ=0.5, base rate ยต=0.3, threshold ฯ=0.3) the empirically measured PoA in false-negative rate is 7.25 (mean aggregate bias 0.375). In contrast, VCG-based aggregation, which rewards each agent's marginal contribution to aggregate accuracy, achieves dominant-strategy incentive compatibility and the lowest empirical PoA among all mechanisms studied (PoA 1.0). On three real-world datasets (NSL-KDD, UNSW-NB15, Credit Card Fraud) with featurepartitioned agents, VCG provides the strongest robustness guarantees among the aggregation methods we evaluate, while maintaining comparable accuracy. In data-sparse regimes (n 500), VCG consistently outperforms stacking and majority voting; under adversarial agents, VCG maintains substantially lower false-negative rates than robust aggregation baselines. Adaptive weight updates further reduce false negatives by 20-22% under distribution shift, with O( T) online regret guarantees. These results establish that how probabilistic predictions are aggregated matters as much as how well individual models are calibrated.
Resource-Element Energy Difference for Noncoherent Over-the-Air Federated Learning
Over-the-air federated learning (OTA-FL) reduces uplink latency by aggregating client updates directly over the wireless multiple-access channel. Coherent analog aggregation realizes this idea by aligning the phases and amplitudes of simultaneously transmitted waveforms, which typically requires synchronization, instantaneous channel-state information (CSI), phase compensation, and power control. Noncoherent energy detection removes the need for phase-coherent combining, but a single energy measurement is nonnegative and, therefore, cannot represent signed model updates. This paper introduces resource-element energy difference (REED), a noncoherent physical-layer primitive for continuous signed aggregation. REED maps the positive and negative parts of each real-valued update to transmit energies on paired orthogonal resource elements and estimates the signed sum by subtracting the corresponding received energies. The construction uses slow-timescale calibration of average channel powers, but does not require instantaneous transmitter- or receiver-side CSI or channel inversion. For independent Rayleigh fading, we derive exact first- and second-moment expressions for single-shot REED and for a chip-diverse extension that spreads each coordinate over multiple independently faded paired chips. The resulting variance laws separate fading-induced self-noise, signal-noise interaction, and receiver-noise fluctuation, giving an explicit diversity-resource tradeoff. More->The rest of abstract is in the paper.
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
Vankadara, Leena Chennuru, Haas, Moritz, Hayward, Luke, Bordt, Sebastian, Breccia, Alessandro
Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale. We take a principled step toward resolving this gap by analyzing three different scaling regimes: (I) co-scaling $N\asymp N_e$, (II) co-scaling $N\asymp M\asymp K$, and (III) full proportional scaling of $N, N_e, M$, and $K$. For each regime, we develop a novel Dynamical Mean Field Theory (DMFT) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal-update ($ฮผ$) desiderata. We then show that the resulting $ฮผ$P prescription does not reliably induce monotonic improvement with scale or robust learning-rate transfer. We trace these pathologies to scale-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability. Guided by this principle, we derive a Maximally Scale-Stable Parameterization (MSSP) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics - qualitatively distinct from the $ฮผ$P limit - through a separate DMFT analysis. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes. Combined with existing depth-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts.
Hedging Memory Horizons for Non-Stationary Prediction via Online Aggregation
Wang, Yutong, Goude, Yannig, Yao, Qiwei
We study online prediction under distribution shift, where inputs arrive chronologically and outcomes are revealed only after prediction. In this setting, predictors must remain stable in quiet regimes yet adapt when regimes shift, and the right adaptation memory is unknown in advance. We propose MELO (Memory-hedged Exponentially Weighted Least-Squares Online aggregation), a model-agnostic method that hedges across adaptation scales: it wraps any non-anticipating base-predictor pool with exponentially weighted least-squares (EWLS) adaptation experts at multiple forgetting factors, and aggregates raw and EWLS-adapted forecasts with MLpol which is a parameter-free online aggregation rule. Under boundedness conditions, we establish deterministic oracle inequalities showing that it competes with both the best raw predictor and the best bounded, time-varying affine combinations of the base predictions, up to a path-length-dependent tracking cost and a sublinear aggregation overhead. We evaluate MELO on French national electricity-load forecasting through the COVID-19 lockdown using no regime indicators, lockdown dates, or policy covariates. MELO reduces overall RMSE by 34.7%relative to base-only MLpol and achieves lower overall RMSE than a TabICL reference supplied with an external COVID policy-response covariate. MELO requires only lightweight per-step recursive updates without model retraining.
Focal Modulation Networks
We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation module for modeling token interactions in vision. Focal modulation comprises three components: (i)hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to fuse the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational cost on the tasks of image classification, object detection, and semantic segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K.
AGraph Similarity for Deep Learning
Graph neural networks (GNNs) have been successful in learning representations from graphs. Many popular GNNs follow the pattern of aggregate-transform: they aggregate the neighbors' attributes and then transform the results of aggregation with a learnable function. Analyses of these GNNs explain which pairs of non-identical graphs have different representations. However, we still lack an understanding of how similar these representations will be. We adopt kernel distance and propose transform-sum-cat as an alternative to aggregate-transform to reflect the continuous similarity between the node neighborhoods in the neighborhood aggregation. The idea leads to a simple and efficient graph similarity, which we name Weisfeiler-Leman similarity (WLS). In contrast to existing graph kernels, WLS is easy to implement with common deep learning frameworks. In graph classification experiments, transform-sum-cat significantly outperforms other neighborhood aggregation methods from popular GNN models. We also develop a simple and fast GNN model based on transform-sum-cat, which obtains, in comparison with widely used GNN models, (1) a higher accuracy in node classification, (2) a lower absolute error in graph regression, and (3) greater stability in adversarial training of graph generation.
0004d0b59e19461ff126e3a08a814c33-AuthorFeedback.pdf
We sincerely appreciate the reviewers for their careful reading, constructive questions and suggestions. We would very1 much like further exchanges to improve our work, but the following is our best effort within the current limits.2 First, we address questions appeared at least twice. We write P1, P2 for paragraph reference, and Rx for reviewers.3 We discuss two main motivations here: lack of graph loss, and empirical failure4 of distinguishing power.