Goto

Collaborating Authors

 pyq


6bb56208f672af0dd65451f869fedfd9-Supplemental.pdf

Neural Information Processing Systems

In most applications,E " Y to begin with (ally are potential maximizer for some vector of costs, otherwise theyare not included in the set), and all points inY havepositive mass. Ittherefore also satisfies this property. We recall that we assume thatθ yields a unique maximum to the linear program onC. As a consequence, all convergent subsequences ofyn converge to the same limity pθq: it is the unique accumulation point of this sequence.Itfollowsdirectlythat yn convergesto y pθq,asitlivesinacompactset,whichyieldsthe desired result. Using different reference vectorsv yield different perturbed operations, andv " p1,2,...,dq is commonlyused.


SubgaussianandDifferentiableImportanceSampling forOff-PolicyEvaluationandLearning

Neural Information Processing Systems

Inthispaper,weanalyze the theoretical properties of the IS estimator by deriving a novel anticoncentration bound that formalizes the intuition behind itsundesired behavior. Then, we propose anew class of IS transformations, based on the notion of power mean. To the best of our knowledge, the resulting estimator is the first to achieve, under certainconditions, twokeyproperties: (i)itdisplays asubgaussianconcentration rate; (ii) it preserves the differentiability in the target distribution.


Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

arXiv.org Artificial Intelligence

We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is implicitly sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model -- highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.


On the contraction properties of Sinkhorn semigroups

arXiv.org Machine Learning

We develop a novel semigroup contraction analysis based on Lyapunov techniques to prove the exponential convergence of Sinkhorn equations on weighted Banach spaces. This operator-theoretic framework yields exponential decays of Sinkhorn iterates towards Schr\"odinger bridges with respect to general classes of $\phi$-divergences as well as in weighted Banach spaces. To the best of our knowledge, these are the first results of this type in the literature on entropic transport and the Sinkhorn algorithm. We also illustrate the impact of these results in the context of multivariate linear Gaussian models as well as statistical finite mixture models including Gaussian-kernel density estimation of complex data distributions arising in generative models.


Lower Bounds on the Expressivity of Recurrent Neural Language Models

arXiv.org Artificial Intelligence

The recent successes and spread of large neural language models (LMs) call for a thorough understanding of their computational ability. Describing their computational abilities through LMs' \emph{representational capacity} is a lively area of research. However, investigation into the representational capacity of neural LMs has predominantly focused on their ability to \emph{recognize} formal languages. For example, recurrent neural networks (RNNs) with Heaviside activations are tightly linked to regular languages, i.e., languages defined by finite-state automata (FSAs). Such results, however, fall short of describing the capabilities of RNN \emph{language models} (LMs), which are definitionally \emph{distributions} over strings. We take a fresh look at the representational capacity of RNN LMs by connecting them to \emph{probabilistic} FSAs and demonstrate that RNN LMs with linearly bounded precision can express arbitrary regular LMs.


On Efficiently Representing Regular Languages as RNNs

arXiv.org Artificial Intelligence

Recent work by Hewitt et al. (2020) provides an interpretation of the empirical success of recurrent neural networks (RNNs) as language models (LMs). It shows that RNNs can efficiently represent bounded hierarchical structures that are prevalent in human language. This suggests that RNNs' success might be linked to their ability to model hierarchy. However, a closer inspection of Hewitt et al.'s (2020) construction shows that it is not inherently limited to hierarchical structures. This poses a natural question: What other classes of LMs can RNNs efficiently represent? To this end, we generalize Hewitt et al.'s (2020) construction and show that RNNs can efficiently represent a larger class of LMs than previously claimed -- specifically, those that can be represented by a pushdown automaton with a bounded stack and a specific stack update function. Altogether, the efficiency of representing this diverse class of LMs with RNN LMs suggests novel interpretations of their inductive bias.


Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration

arXiv.org Artificial Intelligence

This paper studies safe Reinforcement Learning (safe RL) with linear function approximation and under hard instantaneous constraints where unsafe actions must be avoided at each step. Existing studies have considered safe RL with hard instantaneous constraints, but their approaches rely on several key assumptions: $(i)$ the RL agent knows a safe action set for {\it every} state or knows a {\it safe graph} in which all the state-action-state triples are safe, and $(ii)$ the constraint/cost functions are {\it linear}. In this paper, we consider safe RL with instantaneous hard constraints without assumption $(i)$ and generalize $(ii)$ to Reproducing Kernel Hilbert Space (RKHS). Our proposed algorithm, LSVI-AE, achieves $\tilde{\cO}(\sqrt{d^3H^4K})$ regret and $\tilde{\cO}(H \sqrt{dK})$ hard constraint violation when the cost function is linear and $\cO(H\gamma_K \sqrt{K})$ hard constraint violation when the cost function belongs to RKHS. Here $K$ is the learning horizon, $H$ is the length of each episode, and $\gamma_K$ is the information gain w.r.t the kernel used to approximate cost functions. Our results achieve the optimal dependency on the learning horizon $K$, matching the lower bound we provide in this paper and demonstrating the efficiency of LSVI-AE. Notably, the design of our approach encourages aggressive policy exploration, providing a unique perspective on safe RL with general cost functions and no prior knowledge of safe actions, which may be of independent interest.


Precise Asymptotics for Spectral Methods in Mixed Generalized Linear Models

arXiv.org Artificial Intelligence

In a mixed generalized linear model, the objective is to learn multiple signals from unlabeled observations: each sample comes from exactly one signal, but it is not known which one. We consider the prototypical problem of estimating two statistically independent signals in a mixed generalized linear model with Gaussian covariates. Spectral methods are a popular class of estimators which output the top two eigenvectors of a suitable data-dependent matrix. However, despite the wide applicability, their design is still obtained via heuristic considerations, and the number of samples $n$ needed to guarantee recovery is super-linear in the signal dimension $d$. In this paper, we develop exact asymptotics on spectral methods in the challenging proportional regime in which $n, d$ grow large and their ratio converges to a finite constant. By doing so, we are able to optimize the design of the spectral method, and combine it with a simple linear estimator, in order to minimize the estimation error. Our characterization exploits a mix of tools from random matrices, free probability and the theory of approximate message passing algorithms. Numerical simulations for mixed linear regression and phase retrieval display the advantage enabled by our analysis over existing designs of spectral methods.


Compositional Active Inference II: Polynomial Dynamics. Approximate Inference Doctrines

arXiv.org Artificial Intelligence

We develop the compositional theory of active inference by introducing activity, functorially relating statistical games to the dynamical systems which play them, using the new notion of approximate inference doctrine. In order to exhibit such functors, we first develop the necessary theory of dynamical systems, using a generalization of the language of polynomial functors to supply compositional interfaces of the required types: with the resulting polynomially indexed categories of coalgebras, we construct monoidal bicategories of differential and dynamical ``hierarchical inference systems'', in which approximate inference doctrines have semantics. We then describe ``externally parameterized'' statistical games, and use them to construct two approximate inference doctrines found in the computational neuroscience literature, which we call the `Laplace' and the `Hebb-Laplace' doctrines: the former produces dynamical systems which optimize the posteriors of Gaussian models; and the latter produces systems which additionally optimize the parameters (or `weights') which determine their predictions.


Post-processing of Differentially Private Data: A Fairness Perspective

arXiv.org Artificial Intelligence

Post-processing immunity is a fundamental property of differential privacy: it enables arbitrary data-independent transformations to differentially private outputs without affecting their privacy guarantees. Post-processing is routinely applied in data-release applications, including census data, which are then used to make allocations with substantial societal impacts. This paper shows that post-processing causes disparate impacts on individuals or groups and analyzes two critical settings: the release of differentially private datasets and the use of such private datasets for downstream decisions, such as the allocation of funds informed by US Census data. In the first setting, the paper proposes tight bounds on the unfairness of traditional post-processing mechanisms, giving a unique tool to decision-makers to quantify the disparate impacts introduced by their release. In the second setting, this paper proposes a novel post-processing mechanism that is (approximately) optimal under different fairness metrics, either reducing fairness issues substantially or reducing the cost of privacy. The theoretical analysis is complemented with numerical simulations on Census data.