Goto

Collaborating Authors

 Genre


$k$-Nearest Neighbors in Gromov--Wasserstein Space

arXiv.org Machine Learning

The Gromov--Wasserstein (GW) distance provides a framework for comparing metric measure spaces, regardless of their underlying structure or geometry. For network-based data, it enables direct comparisons of graphs with different numbers of nodes, without requiring an embedding or other abstraction. Furthermore, through a variant of GW known as fused Gromov--Wasserstein (fGW), it is also possible to incorporate node features in addition to graph structure. In this work, we implement $k$-nearest neighbors ($k$-NN) classification using the GW and fGW distances. We prove the universal consistency of the GW-$k$-NN classifier on the space of equivalence classes of metric measure spaces with finite support and uniform probability measure. By viewing graphs as finitely supported metric measure spaces equipped with the pairwise distance metric and a uniform probability measure on the nodes, we obtain universal consistency of GW-$k$-NN for the space of graphs. Likewise for fGW-$k$-NN, we prove universal consistency on the space of weak isomorphism classes of structured objects consisting of metric measure spaces with finite support and uniform probability measure and feature maps into Euclidean space, thus establishing universal consistency on the space of node-attributed graphs. Our numerical experiments show that GW-$k$-NN and fGW-$k$-NN consistently perform well across multiple graph datasets, suggesting that metric classifiers such as $k$-NN work well in the GW framework.


Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

arXiv.org Machine Learning

Post-training Large Language Models (LLMs) for reasoning typically focuses on deductive tasks such as mathematics and coding where correctness is verifiable. Yet, many real-world reasoning problems are inductive: agents must infer uncertain beliefs from sparse, ambiguous observations. There are challenges to using standard fine-tuning methods for inductive reasoning, including difficulties in curating large-scale, high-quality labeled datasets and in handling targets that are inherently distributional. In this work, we introduce a novel approach, called Program-based Posterior Training (PPT), to address these limitations: we use an LLM to generate diverse open-world scenarios as probabilistic programs, run probabilistic inference to produce distributional target responses to queries, and then fine-tune on these probabilistic soft labels. Using this approach, we fine-tune LLMs on 10,000 programmatically generated scenarios and evaluate on held-out motifs, humanlabeled judgments, and external benchmarks. Overall, PPT substantially improves estimation accuracy on held-out inductive tasks, increases alignment with human judgments, and transfers to external benchmarks for estimation and calibration. Additionally, the gains in raw calibration are not subsumed by post-hoc temperature scaling, showing that the models have more deeply internalized uncertainty compared to output rescaling. Together, these results suggest that probabilisticprogram-mediated fine-tuning is a promising approach for post-training LLMs to reliably perform approximate inductive inference.


Range Penalization: Theoretical Insights with Applications in Federated Learning

arXiv.org Machine Learning

This paper introduces range regularization for federated learning with linear systematic components to enhance statistical accuracy and induce cross-client regularity conducive to quantization, coding, and resource efficiency. Our approach identifies features with shared weights across different clients and adaptively clusters the weights of personalized features at extreme values, a process we refer to as polar clustering. Theoretical analysis of the associated estimators poses significant challenges due to the seminorm nature and non-decomposability of the regularizer. We develop new proof techniques for the nonasymptotic analysis of statistical accuracy and faithful pattern recovery. Moreover, a fast optimization algorithm that leverages varying degrees of local strong convexity is proposed to reduce iteration complexity. Experiments support the efficacy and efficiency of the proposed approach.


Edge of Stability Selectively Shapes Learning Across the Data Distribution

arXiv.org Machine Learning

Existing analyses of the edge of stability (EoS) treat it as a global property of optimization. We show that it is also selective: the stability constraint redistributes learning across subsets of the training distribution, amplifying progress on some groups while suppressing progress on others. Using a branching intervention that enters or exits the EoS regime from the same training state, we causally demonstrate this trade-off and identify two necessary conditions for a group to benefit. First, its aggregate gradient must align with the top Hessian eigenvector. We isolate this mechanism with a controlled perturbation that preserves distance but randomizes direction, destroying alignment and eliminating the advantage. Second, the group must sustain non-vanishing gradient magnitude over time. Under cross-entropy loss, gradient saturation decouples confidently classified groups, shifting the advantage to output-outliers, whose gradients persist. Together, these results show that EoS functions not only as a stability boundary, but as a mechanism governing the allocation of learning across the data distribution.


Conformal Risk Prediction for Non-Alcoholic Fatty Liver Disease Using Gradient Boosting with Distribution-Free Coverages

arXiv.org Machine Learning

Non-alcoholic fatty liver disease (NAFLD) affects roughly 25% of global adults, posing substantial hepatic and cardiovascular risks. Yet, population-level screening tools remain inadequate. We present Method, a machine-learning framework for NAFLD risk prediction coupling gradient-boosted decision trees with conformal prediction to yield calibrated, distribution-free coverage guarantees on individual risk estimates. It integrates a mutual-information-based stability selection procedure to identify a compact, clinically interpretable feature subset via bootstrap resampling, constructing prediction sets whose marginal coverage provably exceeds a user-specified confidence level. We evaluated Method on a multicenter cohort from Guangzhou, China (primary n=2,187; external validation n=412) using 78 candidate features across demographics, metabolic biomarkers, and lifestyle factors. Method achieves an AUROC of 0.912 internally and 0.891 externally, outperforming deep neural networks, TabNet, support vector machines, and logistic regression. Conformal prediction sets achieve 91.3% empirical coverage at the 90% nominal level. A three-tier risk stratification derived from these scores separates the population into distinct groups, with the high-risk subgroup showing a 12-month progression rate 4.7 times that of the low-risk tier. The selected features -- notably waist circumference, ALT, GGT, triglycerides, fasting glucose, and BMI -- align with established metabolic risk factors, providing biological plausibility.


Integrating Local and Global Entropy for Uncertainty Quantification in LLMs

arXiv.org Machine Learning

Existing methods rely predominantly on token-level signals, leaving the geometric structure of intermediate hidden states underused. In this paper, we take the geometric complexity of hidden-state matrices as a measure of the global uncertainty of LLMs, while treating token-level uncertainty estimation as a local metric. We show that hidden-state geometric entropy (global uncertainty) and token-level entropy (local uncertainty) are statistically near-orthogonal, capturing distinct failure regimes for reliability prediction. In particular, global geometry recovers the confident-but-wrong failure mode that local signals systematically miss. Building on this, we propose Global-Local Uncertainty (GLU), an unsupervised, single-pass score that fuses the two signals via a multiplicative gate. Across three model families and six benchmarks, GLU matches or outperforms all unsupervised baselines while requiring only a single forward pass and remaining length-normalized and architecture-agnostic. Code is available on https://github.com/qcri/GLU.git.


Robust Active Learning for Few-Shot Example Selection in Text-to-SQL

arXiv.org Machine Learning

Few-shot example retrieval is the dominant paradigm for grounding large language models (LLMs) in domain-specific text-to-SQL systems. However, the quality of the annotated example bank directly governs system accuracy, and expert annotation is prohibitively expensive. We formalize the active selection of these examples as a constrained experimental design problem over the intrinsic, low-dimensional manifold of semantic query embeddings. Unlike standard active learning frameworks, our setting introduces three critical challenges: varying, query-dependent annotation reliability (heteroscedasticity), strict requirements for spatial diversity across semantic topics (partition matroid constraints), and the inherent reality that the true covariance structure of the embedding space is unknown (misspecification). To address these, we propose a stratified greedy algorithm that maximizes a heteroscedastic mutual information objective. We prove that this objective remains submodular and approximately monotonic on the intrinsic manifold, yielding a theoretical constant-factor approximation guarantee. We establish a spectral bound demonstrating that this approximation guarantee degrades gracefully, rather than catastrophically, when the assumed surrogate kernel diverges from the true underlying data-generating process. Empirical results demonstrate that the proposed strategy significantly reduces labeling effort while maintaining high text-to-SQL retrieval accuracy.


Disjoint or Overlapping? Inference Windowing for Reconstruction-Based Time Series Anomaly Detection

arXiv.org Machine Learning

Reconstruction-based methods are widely used for time series anomaly detection, where models are trained to reconstruct subsequences, and anomalies are identified through reconstruction errors. However, reported results are often hard to compare due to heterogeneous evaluation practices and underspecified inference procedures. In this paper, we revisit reconstruction-based anomaly detection in the univariate offline setting and study the role of the inference stride, which controls whether subsequences are processed as disjoint windows or with overlap. We propose a unified training, tuning, and multi-seed evaluation protocol on the curated TSB-AD benchmark, and study how overlapping inference affects anomaly detection performance for a range of reconstruction models, including PCA-based baselines, DLinear, an AutoEncoder, TimesNet, and Transformer variants. The results show that across all models, overlapping windows yield consistent improvements, with average relative gain up to +28%, and can alter method rankings. We further analyze variability across datasets, random seeds, and hyperparameter configurations. Finally, we complement the benchmark study with an evaluation on the full UCR archive using localization criteria aligned with sliding-window reconstruction. Overall, our results highlight that reconstruction-based anomaly detection performance depends not only on model architecture and training, but also on inference choices, motivating a clear and reproducible protocol. Our results show that reconstructionbased baselines achieve strong performance on both TSB-AD and UCR benchmarks, supporting them as competitive and practical approaches for univariate time series anomaly detection.


Conservation Laws from Data Symmetry in Neural Networks

arXiv.org Machine Learning

We explore whether intrinsic symmetries of the training data lead to conserved quantities during gradient-flow training of neural networks. Under the assumption that the loss function is analytic and non-polynomial, we prove that data symmetries generically do not induce any additional integrals of motion. For mean squared error (MSE) loss, on the other hand, there are situations in which data augmentation yields extra conserved quantities. We build a framework, utilizing tensorizable networks to describe this phenomenon. Tensorizable networks are a family of architectures whose dependence on parameters and inputs can be separated using an intermediate representation. They include linear and Figure 1: A display of how data symmetry can give polynomial networks, as well as Lightning At-rise to conservation laws. The top row shows the tention.


Deterministic Denominator Design for Localized Tamed Stochastic-Gradient Langevin Dynamics

arXiv.org Machine Learning

Tamed stochastic-gradient Langevin dynamics (SGLD) stabilizes large drifts by adding a denominator to the update. If this denominator uses the same stochastic-gradient sample as the update step, it can also change the conditional mean drift. We study deterministic denominators: the state-dependent envelope is fixed before the current oracle sample is drawn. The main question is how to design this envelope in practice. The design starts from an oracle score, builds a low-cost proxy score on pilot states, chooses activation thresholds by empirical quantiles, and then applies a small calibration layer. The analysis tracks three steps: proxy and threshold errors become envelope errors; envelope errors perturb one SGLD step; and the local residuals give stationary errors through a conditional perturbation bridge. Experiments show that the proxy-quantile denominators are close to oracle-score behavior, avoid the random-denominator mean-shift channel, and improve simple deterministic taming choices.