Goto

Collaborating Authors

 corruption


The Fundamental Limits of Fraud Detection in Card Payment Networks

arXiv.org Machine Learning

Card payment fraud detection is usually framed as a supervised classification problem. Although this approach has generated practical progress, improvement has remained incremental despite major advances in model architecture. We argue that this is not mainly a failure of function approximation or optimization, but a consequence of structural information impairments inherent to the payment ecosystem. We formalize card authorization as a sequential decision problem with delayed, censored, corrupted, and counterfactually missing feedback. We derive a minimax regret lower bound showing that these impairments enter multiplicatively in the denominator of the achievable learning rate. The bound implies that improving issuer reporting quality or reducing censorship can yield larger reductions in the regret floor than increasing model complexity. We also show that heterogeneity across issuers worsens learnability beyond what average impairment rates suggest. The paper contributes a theory of why fraud detection in payment networks is fundamentally harder than in standard online learning settings, identifies ecosystem information quality as the key bottleneck, and provides a theoretical basis for prioritizing investments in reporting infrastructure, dispute process quality, and selective exploration. The paper is theory-first and does not rely on proprietary transaction data.


Robust Statistical Estimators with Bounded Empirical Sensitivity

arXiv.org Machine Learning

We introduce a new measure of robustness for statistical estimators, which we call \emph{empirical sensitivity}. An estimator $\hat ฮธ$ has bounded empirical sensitivity if, with high probability over a dataset $X = (X_1, \dots, X_n) \sim \mathcal{D}^{\otimes n}$, for any dataset $Y$ obtained by modifying at most $ฮทn$ points in $X$, we have that $\hat ฮธ(Y)$ is close to $\hat ฮธ(X)$. We study bounds on this quantity for the prototypical problem of Gaussian mean estimation. We prove new lower bounds, showing that for any estimator $\hat ฮผ$ which achieves an optimal $\ell_2$-error bound of $O\left(\sqrt{d/n}\right)$, the empirical sensitivity is at least $ฮฉ\left(ฮท+ \sqrt{ฮทd/n}\right)$. The two terms arise due to obstructions on the mean and variance (via an Efron-Stein argument) of such an estimator. We show that this bound is tight up to logarithmic factors, by employing recent results for robust empirical mean estimation.


Tippett-minimum Fusion of Representation-space Diffusion Models for Multi-Encoder Out-of-Distribution Detection

arXiv.org Machine Learning

We address out-of-distribution (OOD) detection across the full spectrum of distribution shifts -- global domain changes, semantic divergence, texture differences, and covariate corruptions -- through a multi-encoder fusion of per-encoder representation-space diffusion models (RDMs). We statistically identify each encoder's sensitivity to specific shift types from ID data alone and introduce EncMin2L -- an encoder-agnostic two-level $\min(\cdot)$-gate that combines and calibrates per-encoder diffusion-based likelihood detectors without OOD labels, outperforming monolithic multi-encoder baselines at $2.3\times$ lower parameter cost. Two ID-data diagnostics: $ฮท^2$ (class-conditional F-test) and $ฮ”ฮผ$ (log-likelihood shift under synthetic corruptions) -- quantify encoder specialization, while a Tippett minimum $p$-value combination aggregates per-encoder scores into a single, calibration-stable OOD signal. EncMin2L achieves $\geq 0.94$ AUROC across all four shift types simultaneously, outperforming the state-of-the-art representation-space diffusion OOD detectors across overlapping benchmarks.


The Venture-Capital Populist

The Atlantic - Technology

This story appears in the June 2026 print edition. While some stories from this issue are not yet available to read online, you can explore more from the magazine . Get our editors' guide to what matters in the world, delivered to your inbox every weekday. The courtship between Silicon Valley and MAGA was consummated on June 6, 2024, in San Francisco's Pacific Heights neighborhood, on a street known as "Billionaires' Row," at the 22,000-square-foot, $45 million French-limestone mansion of a venture capitalist named David Sacks. Along with Chamath Palihapitiya, a fellow venture capitalist and a colleague on the podcast, Sacks hosted a fundraiser for Donald Trump. He knew that other technology titans were coming around to the ex-president but remained in the closet. "And I think that this event is going to break the ice on that," Sacks said on the podcast the week before the fundraiser. "And maybe it'll create a preference cascade, where all of a sudden it becomes acceptable to acknowledge the truth." Check out more from this issue and find your next story to read. A few years earlier, Sacks had described the January 6, 2021, riot at the U.S. Capitol as an "insurrection" and pronounced Trump "disqualified" from ever again holding national office. "What Trump did was absolutely outrageous, and I think it brought him to an ignominious end in American politics," he said on the podcast a few days after the event. "He will pay for it in the history books, if not in a court of law." Palihapitiya was more colloquial, calling Trump "a complete piece-of-shit fucking scumbag." These might seem like tricky positions to climb down from--but the path that leads from scathing denunciation through gradual accommodation to sycophantic embrace of Trump is a well-worn pilgrimage trail. The journey is less wearisome for self-mortifiers who never considered democracy (a word seldom spoken on the podcast) all that important in the first place.


Ambient Diffusion: Learning Clean Distributions from Corrupted Data

Neural Information Processing Systems

We present the first diffusion-based framework that can learn an unknown distribution using only highly-corrupted samples. This problem arises in scientific applications where access to uncorrupted samples is impossible or expensive to acquire. Another benefit of our approach is the ability to train generative models that are less likely to memorize any individual training sample, since they never observe clean training data. Our main idea is to introduce additional measurement distortion during the diffusion process and require the model to predict the original corrupted image from the further corrupted image. We prove that our method leads to models that learn the conditional expectation of the full uncorrupted image given this additional measurement corruption. This holds for any corruption process that satisfies some technical conditions (and in particular includes inpainting and compressed sensing). We train models on standard benchmarks (CelebA, CIFAR-10 and AFHQ) and show that we can learn the distribution even when all the training samples have 90%of their pixels missing. We also show that we can finetune foundation models on small corrupted datasets (e.g. MRI scans with block corruptions) and learn the clean distribution without memorizing the training set.


00482b9bed15a272730fcb590ffebddd-Supplemental.pdf

Neural Information Processing Systems

A.1 Training dataset pre-processing We used 40000publicly available videos from YouTube which were available in a spatial resolution of at least 1920 1080 pixels. In an attempt not to skew the distribution of content too far from what may inform biological representation learning, we excluded most artificial content such as screenshots and videos of computer games. To reduce video compression artifacts and prevent systematic downsampling artifacts, each segment was then spatially downsampled to a randomized height between 128 and 160. Each segment was then separated into 15 pairs of neighboring frames, and a randomly placed, but spatially colocated patch of 64 64 pixels was cropped out of each frame pair. The order of the frame pairs was then randomized in a running buffer, and all RGB pixel values were normalized to the range between 0 and 1 before being fed into the model.