Goto

Collaborating Authors

 supp


b98d5883da07b3b3b8afd3fd654dc57a-Paper-Conference.pdf

Neural Information Processing Systems

For such neural networks, we prove a non-constant lower bound of that are compatible with certain polyhedral complexes, more precisely with the the best known lower bound in general is still 2. We focus on neural networks linear functions on R .


Learning Juntas under Markov Random Fields

Neural Information Processing Systems

We give an algorithm for learning O(logn)juntas in polynomial-time with respect to Markov Random Fields (MRFs) in a smoothed analysis framework where only the external field has been randomly perturbed. This is a broad generalization1 of the work of Kalai and Teng, who gave an algorithm that succeeded with respect to smoothed product distributions (i.e., MRFs whose dependency graph has no edges). Our algorithm has two phases: (1) an unsupervised structure learning phase and (2) a greedy supervised learning algorithm. This is the first example where algorithms for learning the structure of undirected graphical models have downstream applications to supervised learning.


Transferring Causal Effects using Proxies

Neural Information Processing Systems

We consider the problem of estimating a causal effect in a multi-domain setting. The causal effect of interest is confounded by an unobserved confounder and can change between the different domains. We assume that we have access to a proxy of the hidden confounder and that all variables are discrete or categorical. We propose methodology to estimate the causal effect in the target domain, where we assume to observe only the proxy variable. Under these conditions, we prove identifiability (even when treatment and response variables are continuous). We introduce two estimation techniques, prove consistency, and derive confidence intervals. The theoretical results are supported by simulation studies and a real-world example studying the causal effect of website rankings on consumer choices.


8c2e2925e75e501088004dd685f0ae81-Paper-Conference.pdf

Neural Information Processing Systems

We study the sample complexity of Bayesian recovery for solving inverse problems with general prior, forward operator and noise distributions. We consider posterior sampling according to an approximate prior P, and establish sufficient conditions for stable and accurate recovery with high probability. Our main result is a non-asymptotic bound that shows that the sample complexity depends on (i) the intrinsic complexity of P, quantified by its approximate covering number, and (ii) concentration bounds for the forward operator and noise distributions. As a key application, we specialize to generative priors, where P is the pushforward of a latent distribution via a Deep Neural Network (DNN). We show that the sample complexity scales log-linearly with the latent dimension k, thus establishing the efficacy of DNN-based priors. Generalizing existing results on deterministic (i.e., non-Bayesian) recovery for the important problem of random sampling with an orthogonal matrix U, we show how the sample complexity is determined by the coherence of U with respect to the support of P. Hence, we establish that coherence plays a fundamental role in Bayesian recovery as well. Overall, our framework unifies and extends prior work, providing rigorous guarantees for the sample complexity of solving Bayesian inverse problems with arbitrary distributions.


Multi-Objective One-Shot Pruning for Large Language Models

Neural Information Processing Systems

Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks but require substantial computational resources, limiting their deployment in resource-constrained environments. While one-shot pruning methods can reduce model size without expensive retraining, they typically optimize for single objectives, ignoring LLMs' multi-faceted applications. We introduce Multi-Objective One-Shot Pruning (MOSP), which formulates LLM pruning as a multi-objective optimization problem. MOSP efficiently generates a Pareto set of pruned models representing different capability trade-offs, allowing users to select solutions aligned with their preferences. The proposed approach identifies share core support while enabling specialized support. Experiments across various LLMs and sparsity levels demonstrate MOSP's superior performance in navigating multi-objective trade-offs compared to baseline methods.


Instance-Dependent Regret Bounds for Nonstochastic Linear Partial Monitoring

Neural Information Processing Systems

In contrast to the classic formulation of partial monitoring, linear partial monitoring can model infinite outcome spaces, while imposing a linear structure on both the losses and the observations. This setting can be viewed as a generalization of linear bandits where loss and feedback are decoupled in a flexible manner. In this work, we address a nonstochastic (adversarial), finite-actions version of the problem through a simple instance of the exploration-by-optimization method that is amenable to efficient implementation. We derive regret bounds that depend on the game structure in a more transparent manner than previous theoretical guarantees for this paradigm. Our bounds feature instance-specific quantities that reflect the degree of alignment between observations and losses, and resemble known guarantees in the stochastic setting. Notably, they achieve the standard T rate in easy (locally observable) games and T2/3 in hard (globally observable) games, where T is the time horizon. We instantiate these bounds in a selection of old and new partial information settings subsumed by this model, and illustrate that the achieved dependence on the game structure can be tight in interesting cases.


Flexible inference for animal learning rules using neural networks

Neural Information Processing Systems

Understanding how animals learn is a central challenge in neuroscience, with growing relevance to the development of animal-or human-aligned artificial intelligence. However, existing approaches tend to assume fixed parametric forms for the learning rule (e.g., Q-learning, policy gradient), which may not accurately describe the complex forms of learning employed by animals in realistic settings. Here we address this gap by developing a framework to infer learning rules directly from behavioral data collected during de novo task learning. We assume that animals follow a decision policy parameterized by a generalized linear model (GLM), and we model their learning rule--the mapping from task covariates to per-trial weight updates--using a deep neural network (DNN). This formulation allows flexible, data-driven inference of learning rules while maintaining an interpretable form of the decision policy itself.


Eigen-Spike Emergence and Quadratic Equivalents for Conjugate Kernels on Nonlinearly Separable Data

arXiv.org Machine Learning

Recent work in random matrix theory (RMT) has developed the notion of deterministic equivalents: typically linear surrogate models that approximate the spectral behavior of large nonlinear random matrices, such as nonlinear feature maps in neural networks (NNs). On the one hand, these deterministic equivalents make theoretical predictions tractable by reducing a complex model to a simpler model with properties that fall under the umbrella of classical RMT tools. However, this leaves open the question of whether this idealized linear equivalence remains meaningful when dealing with high-dimensional nonlinearly separable data, such as performing clssification on nonlinearly separable data. Motivated by this, we consider the conjugate kernel (CK), which is the nonlinear feature map of a feedforward NN, under a canonical nonlinearly separable dataset, the XOR problem; and we use the study of informative outlier eigenvalues in the CK and whether their corresponding eigenvectors asymptotically align with XOR labels as a proxy for nonlinear learnability. We develop a robust quadratic equivalent to the spiked CK matrix that enables a precise analysis of emergent informative spikes, as one modifies various knobs common in ML practice: sample complexity, signal-to-noise ratio (SNR), nonlinear activation choice, and pretrained features. In each of these scenarios, we derive a precise BBP-type phase transition in which linear classification via the CK eigenvectors becomes possible. Our analysis helps translate the power of deterministic equivalence tools in RMT to study problems of practical relevance in ML.


Generative Modeling by Value-Driven Transport

arXiv.org Machine Learning

We propose a new framework for generative modeling based on a discrete-time stochastic control formulation of measure transport. Adapting classic results from control theory, we formulate our problem as a linear program whose dual variables correspond to the \emph{optimal value function} of the control problem, which directly encodes the optimal control policy. Exploiting this LP formulation, we develop an efficient simulation-free primal-dual algorithm for computing approximately optimal value functions and the associated \emph{value-driven transport} (VDT) policies which approximate the true optimal policy. We show that well-trained VDT policies enjoy numerous favorable properties in comparison with other state-of-the-art methods based on flows, diffusions, or Schrödinger bridges: they lead to straight transport paths which can be simulated quickly and robustly, and can be enhanced in all the same ways as diffusion and flow-based models (e.g., conditional generation, classifier-free guidance, unpaired data-to-data translation are all easy to incorporate). We evaluate our methodology in a range of experiments, with results that indicate strong performance and good potential for scalability.


Breaking the Finite-Sample Barrier in Entropy Coupling

arXiv.org Machine Learning

Dependence among marginally constrained observations can break a finite-sample barrier. To formalize this phenomenon, we introduce the \emph{minimum list entropy coupling} $H(P\|Q_1,\dots,Q_m)$, the minimum conditional entropy $H(X|Y_1,\dots,Y_m)$ over all joint distributions with prescribed discrete marginals $X\sim P$ and $Y_i\sim Q_i$. Unlike classical formulations based on independent observations, our model allows $Y_1,\dots,Y_m$ to be arbitrarily dependent while keeping each marginal fixed. This enlarged coupling space reveals a sharp dichotomy: independent observations reduce residual uncertainty exponentially, whereas dependent observations can eliminate it exactly after finitely many samples. We characterize this zero-entropy regime through necessary and sufficient conditions and give concrete structural criteria under which it occurs. In particular, under mild support assumptions, zero entropy is achieved with $O(\log(1/P_{\min}))$ observations, where $P_{\min}$ is the minimum nonzero mass of $P$. We also develop a greedy algorithm with monotone approximation guarantees for computing $H(P\|Q_1,\dots,Q_m)$. Finally, we show that the same framework formalizes finite-sample limits in distribution-matching representation learning and randomness extraction, where zero entropy corresponds to exact recovery and exact extraction.