mathbb
Fisher Flow Matching for Generative Modeling over Discrete Data
Generative modeling over discrete data has recently seen numerous success stories, with applications spanning language modeling, biological sequence design, and graph-structured molecular data. The predominant generative modeling paradigm for discrete data is still autoregressive, with more recent alternatives based on diffusion or flow-matching falling short of their impressive performance in continuous data settings, such as image or video generation. In this work, we introduce Fisher-Flow, a novel flow-matching model for discrete data. Fisher-Flow takes a manifestly geometric perspectiveby considering categorical distributions over discrete data as points residing on a statistical manifold equipped with its natural Riemannian metric: the \emph{Fisher-Rao metric}. As a result, we demonstrate discrete data itself can be continuously reparameterised to points on the positive orthant of the $d$-hypersphere $\mathbb{S}^d_+$, which allows us to define flows that map any source distribution to target in a principled manner by transporting mass along (closed-form) geodesics of $\mathbb{S}^d_+$. Furthermore, the learned flows in Fisher-Flow can be further bootstrapped by leveraging Riemannian optimal transport leading to improved training dynamics. We prove that the gradient flow induced by Fisher-FLow is optimal in reducing the forward KL divergence. We evaluate Fisher-Flow on an array of synthetic and diverse real-world benchmarks, including designing DNA Promoter, and DNA Enhancer sequences. Empirically, we find that Fisher-Flow improves over prior diffusion and flow-matching models on these benchmarks.
PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models
To parameter-efficiently fine-tune (PEFT) large language models (LLMs), the low-rank adaptation (LoRA) method approximates the model changes $\Delta W \in \mathbb{R}^{m \times n}$ through the product of two matrices $A \in \mathbb{R}^{m \times r}$ and $B \in \mathbb{R}^{r \times n}$, where $r \ll \min(m, n)$, $A$ is initialized with Gaussian noise, and $B$ with zeros. LoRA **freezes the original model $W$** and **updates the Noise \& Zero adapter**, which may lead to slow convergence. To overcome this limitation, we introduce **P**r**i**ncipal **S**ingular values and **S**ingular vectors **A**daptation (PiSSA). PiSSA shares the same architecture as LoRA, but initializes the adaptor matrices $A$ and $B$ with the principal components of the original matrix $W$, and put the remaining components into a residual matrix $W^{res} \in \mathbb{R}^{m \times n}$ which is frozen during fine-tuning.Compared to LoRA, PiSSA **updates the principal components** while **freezing the residual parts**, allowing faster convergence and enhanced performance. Comparative experiments of PiSSA and LoRA across 11 different models, ranging from 184M to 70B, encompassing 5 NLG and 8 NLU tasks, reveal that PiSSA consistently outperforms LoRA under identical experimental setups.
The Minimax Rate of HSIC Estimation for Translation-Invariant Kernels
Kernel techniques are among the most influential approaches in data science and statistics. Under mild conditions, the reproducing kernel Hilbert space associated to a kernel is capable of encoding the independence of $M\ge2$ random variables. Probably the most widespread independence measure relying on kernels is the so-called Hilbert-Schmidt independence criterion (HSIC; also referred to as distance covariance in the statistics literature). Despite various existing HSIC estimators designed since its introduction close to two decades ago, the fundamental question of the rate at which HSIC can be estimated is still open. In this work, we prove that the minimax optimal rate of HSIC estimation on $\mathbb{R}^d$ for Borel measures containing the Gaussians with continuous bounded translation-invariant characteristic kernels is $\mathcal{O}\left(n^{-1/2}\right)$. Specifically, our result implies the optimality in the minimax sense of many of the most-frequently used estimators (including the U-statistic, the V-statistic, and the Nyström-based one) on $\mathbb{R}^d$.
Trading off Consistency and Dimensionality of Convex Surrogates for Multiclass Classification
In this paradigm, outcomes must be embedded into the reals with dimension $d \approx n$ in order to design a consistent surrogate loss. Consistent losses are well-motivated theoretically, yet for large $n$, such as in information retrieval and structured prediction tasks, their optimization may be computationally infeasible. In practice, outcomes are typically embedded into some $\mathbb{R}^d$ for $d \ll n$, with little known about their suitability for multiclass classification. We investigate two approaches for trading off consistency and dimensionality in multiclass classification while using a convex surrogate loss. We first formalize partial consistency when the optimized surrogate has dimension $d \ll n$. We then check if partial consistency holds under a given embedding and low-noise assumption, providing insight into when to use a particular embedding into $\mathbb{R}^d$. Finally, we present a new method to construct (fully) consistent losses with $d \ll n$ out of multiple problem instances. Our practical approach leverages parallelism to sidestep lower bounds on $d$.
Instance-Optimal Private Density Estimation in the Wasserstein Distance
Estimating the density of a distribution from samples is a fundamental problem in statistics. In many practical settings, the Wasserstein distance is an appropriate error metric for density estimation. For example, when estimating population densities in a geographic region, a small Wasserstein distance means that the estimate is able to capture roughly where the population mass is. In this work we study differentially private density estimation in the Wasserstein distance. We design and analyze instance-optimal algorithms for this problem that can adapt to easy instances.For distributions $P$ over $\mathbb{R}$, we consider a strong notion of instance-optimality: an algorithm that uniformly achieves the instance-optimal estimation rate is competitive with an algorithm that is told that the distribution is either $P$ or $Q_P$ for some distribution $Q_P$ whose probability density function (pdf) is within a factor of 2 of the pdf of $P$.
Replicability in Learning: Geometric Partitions and KKM-Sperner Lemma
Recent works have revealed the role of geometric partitions and Sperner's lemma (and its variations) in designing replicable learning algorithms and in establishing impossibility results. A partition $\mathcal{P}$ of $\mathbb{R}^d$ is called a $(k,\epsilon)$-secluded partition if for every $\vec{p}\in\mathbb{R}^d$, an $\varepsilon$-radius ball (with respect to the $\ell_{\infty}$ norm) centered at $\vec{p}$ intersects at most $k$ members of $\mathcal{P}$. In relation to replicable learning, the parameter $k$ is closely related to the $\textit{list complexity}$, and the parameter $\varepsilon$ is related to the sample complexity of the replicable learner. Construction of secluded partitions with better parameters (small $k$ and large $\varepsilon$) will lead to replicable learning algorithms with small list and sample complexities. Motivated by this connection, we undertake a comprehensive study of secluded partitions and establish near-optimal relationships between $k$ and $\varepsilon$. 1. We show that for any $(k,\epsilon)$-secluded partition where each member has at most unit measure, it must be that $k \geq(1+2\varepsilon)^d$, and consequently, for the interesting regime $k\in[2^d]$ it must be that $\epsilon\leq\frac{\log_4(k)}{d}$. 2. To complement this upper bound on $\epsilon$, we show that for each $d\in\mathbb{N}$ and each viable $k\in[2^d]$, a construction of a $(k,\epsilon)$-secluded (unit cube) partition with $\epsilon\geq\frac{\log_4(k)}{d}\cdot\frac{1}{8\log_4(d+1)}$. This establishes the optimality of $\epsilon$ within a logarithmic factor.3. Finally, we adapt our proof techniques to obtain a new ``neighborhood'' variant of the cubical KKM lemma (or cubical Sperner's lemma): For any coloring of $[0,1]^d$ in which no color is used on opposing faces, it holds for each $\epsilon\in(0,\frac12]$ that there is a point where the open $\epsilon$-radius $\ell_\infty$-ball intersects at least $(1+\frac23\epsilon)^d$ colors. While the classical Sperner/KKM lemma guarantees the existence of a point that is adjacent to points with $(d+1)$ distinct colors, the neighborhood version guarantees the existence of a small neighborhood with exponentially many points with distinct colors.
Learning a Single Neuron Robustly to Distributional Shifts and Adversarial Label Noise
We study the problem of learning a single neuron with respect to the $L_2^2$-loss in the presence of adversarial distribution shifts, where the labels can be arbitrary, and the goal is to find a best-fit function.More precisely, given training samples from a reference distribution $p_0$, the goal is to approximate the vector $\mathbf{w}^*$which minimizes the squared loss with respect to the worst-case distribution that is close in $\chi^2$-divergence to $p_{0}$.We design a computationally efficient algorithm that recovers a vector $ \hat{\mathbf{w}}$satisfying $\mathbb{E}\_{p^*} (\sigma(\hat{\mathbf{w}} \cdot \mathbf{x}) - y)^2 \leq C \hspace{0.2em}
Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit
Prior works showed that gradient-based training of neural networks can learn this target with $n\gtrsim d^{\Theta(p)}$ samples, and such complexity is predicted to be necessary by the correlational statistical query lower bound. Surprisingly, we prove that a two-layer neural network optimized by an SGD-based algorithm (on the squared loss) learns $f_*$ with a complexity that is not governed by the information exponent. Specifically, for arbitrary polynomial single-index models, we establish a sample and runtime complexity of $n \simeq T = \Theta(d\cdot\mathrm{polylog} d)$, where $\Theta(\cdot)$ hides a constant only depending on the degree of $\sigma_*$; this dimension dependence matches the information theoretic limit up to polylogarithmic factors. More generally, we show that $n\gtrsim d^{(p_*-1)\vee 1}$ samples are sufficient to achieve low generalization error, where $p_* \le p$ is the \textit{generative exponent} of the link function. Core to our analysis is the reuse of minibatch in the gradient computation, which gives rise to higher-order information beyond correlational queries.
This Too Shall Pass: Removing Stale Observations in Dynamic Bayesian Optimization
Bayesian Optimization (BO) has proven to be very successful at optimizing a static, noisy, costly-to-evaluate black-box function $f: \mathcal{S} \to \mathbb{R}$. However, optimizing a black-box which is also a function of time (*i.e.*, a *dynamic* function) $f: \mathcal{S} \times \mathcal{T} \to \mathbb{R}$ remains a challenge, since a dynamic Bayesian Optimization (DBO) algorithm has to keep track of the optimum over time. This changes the nature of the optimization problem in at least three aspects: (i) querying an arbitrary point in $\mathcal{S} \times \mathcal{T}$ is impossible, (ii) past observations become less and less relevant for keeping track of the optimum as time goes by and (iii) the DBO algorithm must have a high sampling frequency so it can collect enough relevant observations to keep track of the optimum through time. In this paper, we design a Wasserstein distance-based criterion able to quantify the relevancy of an observation with respect to future predictions. Then, we leverage this criterion to build W-DBO, a DBO algorithm able to remove irrelevant observations from its dataset on the fly, thus maintaining simultaneously a good predictive performance and a high sampling frequency, even in continuous-time optimization tasks with unknown horizon. Numerical experiments establish the superiority of W-DBO, which outperforms state-of-the-art methods by a comfortable margin.