Goto

Collaborating Authors

 bayes predictor




Conditional independence testing under misspecified inductive biases

Neural Information Processing Systems

Conditional independence (CI) testing is a fundamental and challenging task in modern statistics and machine learning. Many modern methods for CI testing rely on powerful supervised learning methods to learn regression functions or Bayes predictors as an intermediate step; we refer to this class of tests as regression-based tests. Although these methods are guaranteed to control Type-I error when the supervised learning methods accurately estimate the regression functions or Bayes predictors of interest, their behavior is less understood when they fail due to misspecified inductive biases; in other words, when the employed models are not flexible enough or when the training algorithm does not induce the desired predictors. Then, we study the performance of regression-based CI tests under misspecified inductive biases. Namely, we propose new approximations or upper bounds for the testing errors of three regression-based tests that depend on misspecification errors. Moreover, we introduce the Rao-Blackwellized Predictor Test (RBPT), a regression-based CI test robust against misspecified inductive biases. Finally, we conduct experiments with artificial and real data, showcasing the usefulness of our theory and methods.


In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning

Wakayama, Tomoya, Suzuki, Taiji

arXiv.org Machine Learning

This paper develops a finite-sample statistical theory for in-context learning (ICL), analyzed within a meta-learning framework that accommodates mixtures of diverse task types. We introduce a principled risk decomposition that separates the total ICL risk into two orthogonal components: Bayes Gap and Posterior Variance. The Bayes Gap quantifies how well the trained model approximates the Bayes-optimal in-context predictor. For a uniform-attention Transformer, we derive a non-asymptotic upper bound on this gap, which explicitly clarifies the dependence on the number of pretraining prompts and their context length. The Posterior Variance is a model-independent risk representing the intrinsic task uncertainty. Our key finding is that this term is determined solely by the difficulty of the true underlying task, while the uncertainty arising from the task mixture vanishes exponentially fast with only a few in-context examples. Together, these results provide a unified view of ICL: the Transformer selects the optimal meta-algorithm during pretraining and rapidly converges to the optimal algorithm for the true task at test time.


Supplementary materials - NeuMiss networks: differentiable programming for supervised learning with missing values A Proofs

Neural Information Processing Systems

Proof of Lemma 2. Identifying the second and first order terms in X we get: The last equality allows to conclude the proof. Additionally, assume that either Assumption 2 or Assumption 3 holds. This concludes the proof according to Lemma 1. Here we establish an auxiliary result, controlling the convergence of Neumann iterates to the matrix inverse. Note that Proposition A.1 can easily be extended to the general case by working with M (61) i.e., a M nonlinearity is applied to the activations.



On Learning Fairness and Accuracy on Multiple Subgroups

Neural Information Processing Systems

In the upper-level, the fair predictor is updated to be close to all subgroup specific predictors. We further prove that such a bilevel objective can effectively control the group sufficiency and generalization error. We evaluate the proposed framework on real-world datasets.




Review for NeurIPS paper: NeuMiss networks: differentiable programming for supervised learning with missing values.

Neural Information Processing Systems

The paper attacks the classical problem of linear regression with missing values. It computes the Bayes predictor in several cases with missing values and then uses Neumann series to approximate the Bayes predictor. This approximation is then used to design Neural Networks with RelU functions. The propositions describing self-masking missingness, appears to be a novel concept, are interesting but can be considered slightly restrictive because of Linear Gaussian assumptions. However, both the results and the methods should be of interest to NeuriPS 2020 community.