Goto

Collaborating Authors

 coupling


Saddle Networks: Structure-Preserving Architectures for Convex-Concave Functions

arXiv.org Machine Learning

Saddle-point models arise throughout optimization, optimal transport, robust learning, and control. In many applications, the relevant function f(x,y) is convex in x and concave in y, and preserving this geometry is essential for obtaining tractable min--max formulations and reliable certificates. We introduce a structured separable decomposition that preserves the convex-concave geometry and prove a complete one-dimensional approximation theorem under a mixed Monge-type convexity condition. We then describe practical saddle network architectures that preserve convexity in x and concavity in y by construction. The proposed architectures require only convexity-preserving neural networks, together with simple output transformations enforcing sign and concavity constraints. Finally, we report numerical benchmarks in dimension 1 and 5, showing that the proposed saddle networks achieve high accuracy on smooth, nonsmooth, and high-rank convex--concave test functions.


Counterfactually Fair Regression via Optimal Transport

arXiv.org Machine Learning

We consider the problem of learning a counterfactually fair regressor. We adopt a causal uncertainty view in which counterfactual fairness is defined with resampled noise. We focus on obtaining theoretical fairness guarantees for a new post-processing estimator. We begin by showing that counterfactual fairness is equivalent to satisfying demographic parity conditional on the latent variable. This allows us to provide a closed-form expression of the optimal fair regressor via a barycentric quantile map. In order to handle continuous latent variables, we propose a discretized post-processing method. Then, under mild regularity assumptions, we prove high-probability finite-sample fairness guarantees for our estimator, providing an unfairness decay at rate $\tilde O(n^{-1/3})$, and establishing a matching risk bound of order $\tilde O(n^{-1/3})$. We provide a matching lower bound on the excess risk of almost fair predictions. Finally, we extend our results to the setting of relaxed counterfactual fairness. We validate our approach on real-world and synthetic data.


Quadratically Regularized Optimal Transport: Localization Bounds and Affine Case Analysis

arXiv.org Machine Learning

Quadratic regularization has emerged as a potential alternative to the popular entropic regularization in computational optimal transport, offering the theoretical advantage of producing sparse couplings through its hinge density structure. Despite recent progress in one-dimensional settings and general upper bounds, fundamental questions about the localization rate of QOT optimizers around the Monge coupling have remained open. In this work, we establish a general lower bound showing that the support of the QOT optimizer cannot concentrate around the Monge graph faster than order $\varepsilon^{\frac{1}{d+2}}$ in the directed Hausdorff distance, matching the conjectured optimal exponent under standard regularity assumptions in \citet{wiesel2025sparsity}. We also show that the QOT value gap controls the mean-squared deviation $\mathbb E_{ฯ€_\varepsilon}\|y-T(x)\|^2$ by the scale of $\varepsilon^{\frac{2}{d+2}}$. As a corollary, in the affine Brenier regime, which includes Gaussian-to-Gaussian transport, we derive a sharp pointwise tube bound of order $\varepsilon^{\frac{1}{d+2}}$ by reducing the problem to self-transport and applying recent self-transport sparsity results. Finally, we validate our theoretical bound with a synthetic experiment in high-dimensional settings.


Sliced-Regularized Optimal Transport

arXiv.org Machine Learning

We propose a new regularized optimal transport (OT) formulation, termed sliced-regularized optimal transport (SROT). Unlike entropic OT (EOT), which regularizes the transport plan toward an independent coupling, SROT regularizes it toward a smoothened sliced OT (SOT) plan. To the best of our knowledge, SROT is the first approach to leverage a version of SOT plan as a reference to improve classical OT. We provide a formal definition of SROT, derive its dual formulation, and provide a post-Bayesian interpretation of SROT. We then develop a Sinkhorn-style algorithm for efficient computation, retaining the same scalability advantages as EOT. By incorporating a scalable SOT plan as a prior, SROT yields more accurate approximations of the exact OT plan than EOT under the same level of regularization. Moreover, the resulting transport plan improves upon the reference SOT plan itself. We further introduce the corresponding OT divergence induced by SROT, named SROT divergence, and analyze its topological and computational properties. Finally, we validate our approach through experiments on synthetic datasets and color transfer tasks, demonstrating that SROT is better than both EOT and SOT in approximating exact OT. Additional experiments on gradient flows further highlight the advantages of SROT divergence.


Optimizing Computational-Statistical Runtime for Wasserstein Distance Estimation

arXiv.org Machine Learning

Squared Wasserstein distance is a frequently used tool to measure discrepancy between probability distributions. This distance is typically computed between empirical measures of size $n$ from two underlying random samples. Unfortunately, even in lower dimensional Euclidean space problems $\left( d \in \{2,3\} \right)$, algorithms for Wasserstein distance computation with approximate or exact precision guarantees scale poorly in the runtime as a function of $n$ and the desired precision. In response, we consider the computational-statistical runtime, where the goal is to estimate from samples the Wasserstein distance between potentially smooth measures up to $ฮต$-additive error in expectation with respect to the sampling; we allow $O(1)$ computational cost for collecting a sample. Towards this, we develop a Sample-Sketch-Solve paradigm where we introduce a regular cartesian grid sketch of the samples. We show that (especially under $ฮฑ$-Hรถlder smooth distributions) this can compress the data without increasing asymptotic error, and also regularizes the structure which enables faster exact algorithms. Ultimately, we approximate $W_2^2(P,Q)$ within $ฮต$ error in $ฮต^{-\max(2,\frac{d+1+o(1)}{1+ฮฑ})}$ time for $0 < ฮฑ< 1$ Hรถlder smooth distributions $P,Q$ on $(0,1)^{d}$; an optimal $ฮ˜(ฮต^{-2})$ for $ฮฑ> 1/2$ when $d=2$ and nearly optimal as $ฮฑ\to 1$ when $d = 3$.


Reasoning Models Don't Just Think Longer, They Move Differently

arXiv.org Machine Learning

Reasoning-trained language models often spend more tokens on harder problems, but longer chains of thought do not show whether a model is merely computing for more steps or following a different internal trajectory. We study this distinction through hidden-state trajectories during chain-of-thought generation across competitive programming, mathematics, and Boolean satisfiability. Raw trajectory geometry is strongly shaped by generation length: longer generations mechanically alter path statistics, so difficulty-dependent comparisons are misleading without adjustment. After residualizing trajectory statistics on length, difficulty remains systematically coupled to corrected trajectory geometry across all domains studied. The clearest reasoning-specific separation appears in the code domain, where harder problems show more direct corrected trajectories and less heterogeneous local curvature in reasoning-trained models than in matched instruction-tuned baselines. Corrected difficulty-geometry coupling is weaker, but still present, in mathematics and Boolean satisfiability. Prompt-stage linear probes do not mirror the code-domain separation, and behavioral annotations show that stronger corrected coupling co-occurs with strategy shifts and uncertainty monitoring. Together, these findings establish length correction as a prerequisite for generation-time trajectory analysis and show that reasoning training can be associated with distinct corrected trajectory geometry, with the strength of the effect depending on the domain.


Breaking the Finite-Sample Barrier in Entropy Coupling

arXiv.org Machine Learning

Dependence among marginally constrained observations can break a finite-sample barrier. To formalize this phenomenon, we introduce the \emph{minimum list entropy coupling} $H(P\|Q_1,\dots,Q_m)$, the minimum conditional entropy $H(X|Y_1,\dots,Y_m)$ over all joint distributions with prescribed discrete marginals $X\sim P$ and $Y_i\sim Q_i$. Unlike classical formulations based on independent observations, our model allows $Y_1,\dots,Y_m$ to be arbitrarily dependent while keeping each marginal fixed. This enlarged coupling space reveals a sharp dichotomy: independent observations reduce residual uncertainty exponentially, whereas dependent observations can eliminate it exactly after finitely many samples. We characterize this zero-entropy regime through necessary and sufficient conditions and give concrete structural criteria under which it occurs. In particular, under mild support assumptions, zero entropy is achieved with $O(\log(1/P_{\min}))$ observations, where $P_{\min}$ is the minimum nonzero mass of $P$. We also develop a greedy algorithm with monotone approximation guarantees for computing $H(P\|Q_1,\dots,Q_m)$. Finally, we show that the same framework formalizes finite-sample limits in distribution-matching representation learning and randomness extraction, where zero entropy corresponds to exact recovery and exact extraction.


Finite-size scaling of hetero-associative retrieval in continuous-signal-driven Ising spin systems

arXiv.org Machine Learning

Kosko's Bidirectional Associative Memory [17] first formalised this idea for two layers, showing that stable recallContent-addressable memory--the recovery of a complete stored record from a partial or degraded cue--is aarises from the same energy-descent principle as in Hopcornerstone of neural computation and a paradigmaticfield networks but across two distinct pattern spaces: a problem in the statistical mechanics of disordered sys-cue presented to one layer drives the other toward the tems. The Hopfield model [1] demonstrated that binarymatching stored pattern, enabling cross-modal compleNtion. Multi-species spin-glass analyses [18] subsequentlypatterns in { 1,+1} can be stored as fixed-point attractors of an energy landscape shaped by Hebbian couplings, provided a rigorous thermodynamic foundation for arwhile Little's earlier stochastic formulation [2] cast thechitectures with an arbitrary number of interacting popsame architecture in the language of equilibrium statisti-ulations, generalising the classical single-species phase cal mechanics through parallel probabilistic updates.


Coupling-Informed Transport Maps for Bayesian Filtering in Nonlinear Dynamical Systems

arXiv.org Machine Learning

A likelihood-free transport filtering method is proposed based on the couplings between state and observation variables. By exploiting a block-triangular structure in the transport map, the analysis step of filtering is reformulated as the minimization of the maximum mean discrepancy (MMD) between the true joint measure and its transport-based approximation. To circumvent the non-convexity in the MMD optimization, we introduce a training-free transport filter method via gradient flows, which leads to an analytic computation for the transport map that implies the steepest descent direction of the MMD. The proposed approach accurately approximates non-Gaussian filtering posteriors and avoids particle collapse. We provide a convergence analysis for the expectation of the MMD between the approximated posterior and the truth posterior. Finally, we extend the method to high-dimensional problems through domain localization. Numerical examples demonstrate the superior performance of our approach over conventional filtering methods in nonlinear, non-Gaussian scenarios.


Couple to Control: Joint Initial Noise Design in Diffusion Models

arXiv.org Machine Learning

Diffusion models typically generate image batches from independent Gaussian initial noises. We argue that this independence assumption is only one choice within a broader class of valid joint noise designs. Instead, one can specify a coupling of the initial noises: each noise remains marginally standard Gaussian, so the pretrained diffusion model receives the same single-sample input distribution, while the dependence across samples is chosen by design. This reframes initial-noise control from selecting or optimizing individual seeds to designing the dependence structure of a multi-sample gallery. This view gives a general framework for initial-noise design, covering several existing methods as special cases and leading naturally to new coupled-noise constructions. Coupled noise can improve generation on its own without adding sampling cost, and it is flexible enough to serve as a structured initialization for optimization-based pipelines when additional computation is available. Empirically, repulsive Gaussian coupling improves gallery diversity on SD1.5, SDXL, and SD3 while largely preserving prompt alignment and image quality. It matches or outperforms recent test-time noise-optimization baselines on several diversity metrics at the same sampling cost as independent generation. Subspace couplings also support fixed-object background generation, producing diverse, natural backgrounds compared with specialized inpainting baselines, with a tunable trade-off in foreground fidelity.