Goto

Collaborating Authors

 contamination


693e00827fd44bdfca210801fe1e6439-Paper-Position_Paper_Track.pdf

Neural Information Processing Systems

The meteoric rise of Artificial Intelligence (AI), with its rapidly expanding market capitalization, presents both transformative opportunities and critical challenges. Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. Issues like data contamination and selective reporting by model developers fuel hype, while inadequate data quality control can lead to biased evaluations that, even if unintentionally, may favor specific approaches. As a flood of participants enters the AI space, this "Wild West" of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody's. In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that a laissezfaire approach is untenable. For true and sustainable AI advancement, we call for a paradigm shift to a unified, live, and quality-controlled benchmarking framework--robust by construction rather than reliant on courtesy or goodwill.


Refining Norms: APost-hoc Framework for OOD Detection in Graph Neural Networks

Neural Information Processing Systems

Graph Neural Networks (GNNs) are increasingly deployed in mission-critical tasks, yet they often encounter inputs that lie outside their training distribution, leading to unreliable or overconfident predictions. To address this limitation, we present RAGNOR (Robust Aggregation Graph Norm for Outlier Recognition), a post-hoc approach that leverages embedding norms for robust out-of-distribution (OOD) detection on both node-level and graph-level tasks. Unlike previous methods designed primarily for image domains, RAGNOR directly tackles the relational challenges intrinsic to graphs: local contamination by anomalous neighbors, disparate norm scales across classes or roles, and insufficient references for boundary or low-degree nodes.


MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Neural Information Processing Systems

The rapid advancement of reasoning capabilities in large language models (LLMs) has led to notable improvements on mathematical benchmarks. However, many of the most commonly used evaluation datasets (e.g., AIME 2024) are widely available online, making it difficult to disentangle genuine reasoning from potential memorization. Furthermore, these benchmarks do not evaluate proof-writing capabilities, which are crucial for many mathematical tasks. To address this, we introduce MATHARENA, a new benchmark based on the following key insight: recurring math competitions provide a stream of high-quality, challenging problems that can be used for real-time evaluation of LLMs. By evaluating models as soon as new problems are released, we effectively eliminate the risk of contamination.


MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Neural Information Processing Systems

The rapid advancement of reasoning capabilities in large language models (LLMs) has led to notable improvements on mathematical benchmarks. However, many of the most commonly used evaluation datasets (e.g., AIME 2024) are widely available online, making it difficult to disentangle genuine reasoning from potential memorization. Furthermore, these benchmarks do not evaluate proof-writing capabilities, which are crucial for many mathematical tasks.To address this, we introduce MathArena, a new benchmark based on the following key insight: recurring math competitions provide a stream of high-quality, challenging problems that can be used for real-time evaluation of LLMs. By evaluating models as soon as new problems are released, we effectively eliminate the risk of contamination.Using this framework, we find strong signs of contamination in AIME 2024. Nonetheless, evaluations on harder competitions, such as CMIMC 2025, demonstrate impressive reasoning capabilities in top-performing models.MathArena is also the first benchmark for proof-writing capabilities. On IMO 2025, top models achieve slightly less than 40\%, demonstrating both notable progress and significant room for improvement.So far, we have evaluated over $50$ models across seven competitions, totaling $162$ problems. As an evolving benchmark, MathArena will continue to track the progress of LLMs on newly released competitions, ensuring rigorous and up-to-date evaluation of mathematical reasoning.


Joint Model and Data Sparsification via the Marginal Likelihood

arXiv.org Machine Learning

Sparse recovery in linear systems underpins applications from signal processing to high-dimensional regression. Sparse Bayesian Learning, grounded in the principle of automatic relevance determination (ARD), offers a practical Bayesian mechanism for feature sparsity via marginal likelihood optimization. Yet, its reliance on a homoscedastic noise model renders it sensitive to data contaminations such as outliers or misspecified noise, harming model fit and predictions. Instead, we propose jointly learning individual feature and sample relevancies, enabling simultaneous model and data sparsification via a single Bayesian objective. This symmetric pruning of model and data offers a natural extension that preserves conjugacy, admits closed-form updates for standard optimization procedures, and aligns with perspectives from robust regression and influence functions. Empirical results across diverse regression tasks affirm that a joint ARD approach consistently yields both sparse and robust prediction models.


On the Sample Complexity of Robust Binary Hypothesis Testing

arXiv.org Machine Learning

We study the sample complexity of robust binary hypothesis testing under three standard contamination models: $\varepsilon$-additive (Huber), $\varepsilon$-subtractive, and $\varepsilon$-total variation (TV), denoted by $n^*_{\mathrm{Hub}}(\varepsilon)$, $n^*_{\mathrm{Sub}}(\varepsilon)$, and $n^*_{\mathrm{TV}}(\varepsilon)$, respectively. For subtractive contamination, we show that least favourable distributions exist and provide explicit formulas for the same, bringing this model in line with the classical Huber and TV models. Next we show that in all three models, sample complexity may be highly unstable in the contamination parameter $\varepsilon$, increasing by polynomial factors even for $o(\varepsilon)$ perturbations. Similarly, there may be polynomial factor gaps between the sample complexities when $\varepsilon$ is known exactly versus when it is known up to $o(\varepsilon)$ error. Despite the instability of the sample complexity in all models, we show that the sample complexities across models are comparable up to constant-factor rescaling of $\varepsilon$. Specifically, for any fixed $δ_0>0$, the following hold for all distributions $p$ and $q$: (i) $n^*_{\mathrm{Hub}}(\varepsilon) \lesssim n^*_{\mathrm{TV}}(\varepsilon) \lesssim n^*_{\mathrm{Hub}}(2\varepsilon)$, (ii) $n^*_{\mathrm{Sub}}(\varepsilon) \lesssim n^*_{\mathrm{TV}}(\varepsilon) \lesssim n^*_{\mathrm{Sub}}((2+δ_0)\varepsilon)$, and (iii) $n^*_{\mathrm{Sub}}(\varepsilon) \lesssim n^*_{\mathrm{Hub}}(\varepsilon) \lesssim n^*_{\mathrm{Sub}}((1+δ_0)\varepsilon)$, and the scaling constants are tight. Finally, we extend our results to adaptive versions of the contamination models.


Mean-Shift PCA by Knockoff Mean

arXiv.org Machine Learning

Removing noise is difficult, but adding noise is easy. In this work, we show how to eliminate mean-shift noisy components from PCA by deliberately introducing knockoff mean-shift perturbation. Standard PCA is highly sensitive to shifts in the sample mean: a small fraction of samples from a shifted distribution can cause large deviations in the leading principal components. In high-dimensional regimes, existing Robust PCA approaches cannot handle the mean-shift contamination structure inherent in the mixture model. Using tools from Random Matrix Theory, we prove that the mean-shift spikes are spectrally separable from the stable eigenvalues of the original covariance. Furthermore, the original eigenspace remains asymptotically invariant to the contamination, independent of the mixture weight. Exploiting this spectral stability, we propose a simple, two-stage PCA algorithm by adding knockoff mean that identifies and removes the mean-shift component using only standard PCA operations.


Three Costs of Amortizing Gaussian Process Inference with Neural Processes

arXiv.org Machine Learning

Neural processes amortize Gaussian process inference, replacing the exact $O(n^3)$ posterior with a learned $O(n)$ map from context sets to predictive distributions. For a class of latent neural processes, we bound the Kullback--Leibler (KL) divergence between the GP and LNP predictives, decomposing it into three interpretable sources, namely label contamination as the neural process uses label values to estimate a quantity that is label-independent in the exact GP, an information bottleneck because the finite-dimensional representation cannot resolve the full context geometry, and amortization error from a single encoder network shared across all contexts. The bottleneck truncation term decays in the representation dimension $d$ as $O(e^{-cd^{2/d_x}})$ for squared-exponential kernels on $\mathbb{R}^{d_x}$ where $c > 0$ is a kernel-dependent constant and as $O(d^{-2ν/d_x})$ for Matérn-$ν$ kernels, directly linking architecture sizing to kernel smoothness and input dimension. The label contamination term is $O(1)$ in general, with only the observation-noise component decaying as $O(1/n)$, identifying a persistent cost of routing uncertainty estimation through a label-dependent representation. These results characterize the costs of amortization within the analyzed class and yield architectural recommendations to predict variance from context locations alone in the GP-amortization regime, and replace mean aggregation with second-order pooling to close the dominant amortization gap.


When Does Trimming Help Conformal Prediction? A Retained-Law Diagnostic under Calibration Contamination

arXiv.org Machine Learning

Trimming suspicious calibration points is a common response to contamination in conformal prediction. Its effect on clean-target coverage, however, is governed by the retained law induced by trimming, not by the contamination level alone. We analyse fixed-threshold trimming as conditioning rather than purification. It replaces the contaminated calibration law with a retained law, reducing clean-target coverage to a one-dimensional score-CDF transfer problem with an exact finite-sample identity. A componentwise bound on the transfer gap gives a population-level diagnostic. This separates a clean-side covariance cost from a retained-contamination cost, governed by the dirty-to-clean retention ratio. Trimming helps when the anomaly score separates retention probabilities while remaining score-neutral on the clean population. Otherwise, it cannot substantially reduce contamination through the retained mixture coefficient. We also give finite-sample certificate templates that provide numerical guarantees under independent audit.


Smart Ensemble Learning Framework for Predicting Groundwater Heavy Metal Pollution

arXiv.org Machine Learning

Groundwater in the Densu Basin is increasingly threatened by heavy metal contamination, but conventional methods fail to capture the statistical complexity and spatial heterogeneity of pollution indicators. A key challenge is modelling the Heavy Metal Pollution Index (HPI), which is typically skewed and affected by correlated contaminants, leading to biased predictions without transformation. This study develops a predictive framework integrating response transformations with nested cross-validated ensemble machine learning. Three transformations (raw, log, and Gaussian copula) were applied to HPI and evaluated across six learners: support vector regression (SVM), $k$-nearest neighbours (k-NN), CART, Elastic Net, kernel ridge regression, and a stacked Lasso ensemble. Raw-scale models produced deceptively high fits (Elastic Net and stacked ensemble $R^2 \approx 1.0$), suggesting over-optimism. The log transformation stabilised variance (SVM: $R^2 = 0.93$, RMSE $= 0.18$; k-NN: $R^2 = 0.92$, RMSE $= 0.20$). The Gaussian copula gave the most reliable results: stacked ensemble $R^2 = 0.96$ (RMSE $= 0.19$), with other learners maintaining high accuracy. Copula-based models improved residuals and produced spatially plausible maps. DBSCAN clustering revealed Fe and Mn as primary HPI contributors, consistent with regional hydrogeochemistry. Limitations include reliance on random (not spatial) cross-validation and basin-specific scope. Future work should explore spatial validation and other geological settings. Overall, distribution-aware ensembles with clustering diagnostics offer robust, interpretable assessments of groundwater contamination.