characterization
Computable universal online learning
Understanding when learning is possible is a fundamental task in the theory of machine learning. However, many characterizations known from the literature deal with abstract learning as a mathematical object and ignore the crucial question: when can learning be implemented as a computer program? We address this question for universal online learning, a generalist theoretical model of online binary classification, recently characterized by Bousquet et al. (STOC 21). In this model, there is no hypothesis fixed in advance; instead, Adversary--playing the role of Nature--can change their mind as long as local consistency with the given class of hypotheses is maintained. We require Learner to achieve a finite number of mistakes while using a strategy that can be implemented as a computer program. We show that universal online learning does not imply computable universal online learning, even if the class of hypotheses is relatively easy from a computabilitytheoretic perspective. We then study the agnostic variant of computable universal online learning and provide an exact characterization of classes that are learnable in this sense. We also consider a variant of proper universal online learning and show exactly when it is possible. Together, our results give a more realistic perspective on the existing theory of online binary classification and the related problem of inductive inference.
Greedy Algorithm for Structured Bandits: ASharp Characterization of Asymptotic Success / Failure
We study the greedy (exploitation-only) algorithm in bandit problems with a known reward structure. We allow arbitrary finite reward structures, while prior work focused on a few specific ones. We fully characterize when the greedy algorithm asymptotically succeeds or fails, in the sense of sublinear vs. linear regret as a function of time. Our characterization identifies a partial identifiability property of the problem instance as the necessary and sufficient condition for the asymptotic success. Notably, once this property holds, the problem becomes easy--any algorithm will succeed (in the same sense as above), provided it satisfies a mild non-degeneracy condition. Our characterization extends to contextual bandits and interactive decision-making with arbitrary feedback. Examples demonstrating broad applicability and extensions to infinite reward structures are provided.
Structural Causal Bandits under Markov Equivalence
In decision-making processes, an intelligent agent with causal knowledge can optimize action spaces to avoid unnecessary exploration. A structural causal bandit framework provides guidance on how to prune actions that are unable to maximize reward by leveraging prior knowledge of the underlying causal structure among actions. A key assumption of this framework is that the agent has access to a fully-specified causal diagram representing the target system. In this paper, we extend the structural causal bandits to scenarios where the agent leverages a Markov equivalence class. In such cases, the causal structure is provided to the agent in the form of a partial ancestral graph (PAG). We propose a generalized framework for identifying potentially optimal actions within this graph structure, thereby broadening the applicability of structural causal bandits.
A solvable model of learning generative diffusion: theory and insights
In this manuscript, we analyze a solvable model of flow or diffusion-based generative model. We consider the problem of learning a model parametrized by a two-layer auto-encoder, trained with online stochastic gradient descent, on a highdimensional target density with an underlying low-dimensional manifold structure. We derive a tight asymptotic characterization of low-dimensional projections of the distribution of samples generated by the learned model, ascertaining in particular its dependence on the number of training samples. Building on this analysis, we discuss how mode collapse can arise, and lead to model collapse when the generative model is re-trained on generated synthetic data.
PAC Learning with Bandit Feedback: Sharp Sample Complexity in the Realizable Setting
Hanneke, Steve, Meng, Qinglin, Moran, Shay, Shaeiri, Amirreza
We study the problem of multiclass PAC learning with bandit feedback in the realizable setting. In this framework, there is an unknown data distribution over an instance space $\mathcal{X}$ and a label space $\mathcal{Y}$, as in classical multiclass PAC learning, but the learner does not observe the labels of the i.i.d. training examples. Instead, in each round, it receives an unlabeled instance, predicts its label, and receives bandit feedback indicating only whether the prediction is correct. Despite this restriction, the goal remains the same as in classical PAC learning. We provide a general characterization of the optimal sample complexity of this problem, sharp for every concept class up to logarithmic factors. Our characterization is based on a new combinatorial dimension, termed the bandit $\mathrm{DS}$ dimension, defined via generalized combinatorial structures we call pseudo-boxes. These extend the pseudo-cubes underlying the $\mathrm{DS}$ dimension by allowing a different number of neighbors in each coordinate. In contrast to the $\mathrm{DS}$ dimension, which governs the full-information setting by counting the number of coordinates in the pseudo-cube, the bandit $\mathrm{DS}$ dimension aggregates the number of neighbors across coordinates, leading to a characterization in which the sample complexity scales with the total number of neighbors. We also propose a general learning algorithm achieving the upper bound, based on an algorithmic principle called ListCascade, which connects bandit learning to list learning and may be of independent interest.
Debiasing Random Oblique Projections for Subsampled OLS and Fast CUR in High Dimensions
Niu, Chengmei, Garg, Sachin, Dereziński, Michał, Liao, Zhenyu
Random sampling is a fundamental tool in modern machine learning and numerical linear algebra for reducing the computational cost of large-scale matrix problems. Existing analyses, however, rely primarily on subspace embedding guarantees, which do not precisely characterize the statistical bias of nonlinear random oblique projections induced by sampling, which arises ubiquitously in subsampled least squares and fast low-rank approximation methods. Because (pseudo)inversion is nonlinear, these random oblique projections can be systematically biased even when the underlying sketch is unbiased, thereby introducing hidden bias into downstream least squares and low-rank approximation solutions. In this work, we develop a unified non-asymptotic theory for random oblique projections in high dimensions. We show that standard random sampling schemes generally induce a systematic statistical bias overlooked by classical subspace embedding-style analyses, and we propose a principled debiasing framework to correct it. We illustrate the power of the theory through two canonical applications. For subsampled least squares, we obtain sharp bias--variance characterizations, reveal previously unrecognized statistical suboptimality in widely used sampling schemes, and identify when debiasing yields provable improvements. For fast CUR decomposition, we develop a debiased approach with improved approximation accuracy. Numerical experiments further validate our theoretical findings.
Concomitant DAG Learning: On the Roles of Noise Adaptivity, Sparsity, and Non-negativity
Mateos, Gonzalo, Rey, Samuel, Ajorlou, Hamed, Tepper, Mariano
Directed acyclic graphs (DAGs) constitute a central modeling tool to enable principled reasoning about cause-effect interactions in complex systems. However, since the causal structure underlying a group of variables is often unknown and interventions may be infeasible or ethically challenging to implement, there is a need to address the task of inferring DAGs from observational data. However, most classical structure identification approaches face two key obstacles: the combinatorial challenge of enforcing acyclicity, which severely limits scalability, and identifiability challenges arising from latent confounding or heterogeneous noise. This tutorial offers an overview of recent signal processing and optimization advances that address these issues by recasting DAG structure learning as a continuous, score-based estimation problem over adjacency matrices. We begin with a didactic introduction to structural equation models and the formulation of causal graph recovery, followed by a historical survey of score-based methods ranging from early combinatorial search schemes and greedy heuristics to modern continuous frameworks that leverage smooth characterizations of acyclicity. Building on this foundation, we describe concomitant DAG estimation methods that jointly infer sparse causal structure and exogenous noise levels, improving robustness under heteroscedasticity and distribution shifts by rendering the estimator noise adaptive. All in all, the tutorial introduces readers to challenges and opportunities for signal processing research at the crossroads of causal inference, high-dimensional statistics, and scalable graph learning, while outlining emerging directions including online, nonlinear, and neural causal discovery.
Contradiction Graphs Determine VC Dimension
Campbell, Jesse, Ibaibarriaga, Daniel, Reyzin, Lev
The Vapnik-Chervonenkis dimension is the fundamental combinatorial parameter of distribution-free binary classification. Introduced by Vapnik and Chervonenkis in their work on uniform convergence [VC71], and closely connected to the Sauer-Shelah lemma [Sau72, She72], it characterizes classical PAC learnability [Val84, BEHW89, EHKV89]. In particular, finite VC dimension is equivalent to distribution-free learnability. This paper asks whether that finite-versus-infinite VC dichotomy is still visible after replacing a concept class by its contradiction graphs. For a binary class H {0,1}X, the order-m contradiction graph Gm(H) has as vertices the H-realizable labeled samples of length m, with an edge between two samples if they assign opposite labels to some common domain point. Throughout, samples are ordered sequences, and repetitions of domain points are allowed, subject to consistent labeling. We use the contradiction-graph framework introduced by Alon et al. in their graph-theoretic characterization of private learnability [AMSY24]. They ask whether two binary classes can have isomorphic contradiction graphs at every level while one has finite VC dimension and the other has infinite VC dimension.
What is Learnable in Valiant's Theory of the Learnable?
Hanneke, Steve, Mehrotra, Anay, Velegkas, Grigoris, Zampetakis, Manolis
Valiant's 1984 paper is widely credited with introducing the PAC learning model, but it, in fact, introduced a different model: unlike PAC learning, the learner receives only positives, may issue membership queries, and must output a hypothesis with no false positives. Prior work characterized variants, including the case without queries. We revisit Valiant's original model and ask: *Which classes are learnable in it?* For every finite domain, including Valiant's Boolean-hypercube setting, we show that a class is learnable if and only if every realizable positive sample can be certified by a poly-size adaptive query-compression scheme. This is a new variant of sample compression where the learner certifies samples via a short interaction with the membership oracle. Our characterization shows that learnability in Valiant's model is strictly sandwiched between learnability in the PAC model and the variant of Valiant's model without membership queries. This is one of the rare cases where introducing membership queries changes the set of learnable classes, and not just the sample or computational complexity. Next, we study the natural extension of the model to arbitrary domains. While we do not obtain an exact characterization, our techniques readily generalize and show that the same strict sandwiching persists. Finally, we show that $d$-dimensional halfspaces, which are not learnable without queries, are learnable with queries: we give a $\mathrm{poly}(d) \tilde{O}(1/ε)$ sample and $\mathrm{poly}(d) \mathrm{polylog}(1/ε)$ query algorithm, and prove that at least $Ω(d)$ samples or queries are necessary. To our knowledge, this is the first algorithm for halfspaces in Valiant's model. Together, these results uncover a surprisingly rich theory behind Valiant's original notion of learnability and introduce ideas that may be of independent interest in learning theory.
Optimal Policy Learning under Budget and Coverage Constraints
We study optimal policy learning under combined budget and minimum coverage constraints. We show that the problem admits a knapsack-type structure and that the optimal policy can be characterized by an affine threshold rule involving both budget and coverage shadow prices. We establish that the linear programming relaxation of the combinatorial solution has an O(1) integrality gap, implying asymptotic equivalence with the optimal discrete allocation. Building on this result, we analyze two implementable approaches: a Greedy-Lagrangian (GLC) and a rank-and-cut (RC) algorithm. We show that the GLC closely approximates the optimal solution and achieves near-optimal performance in finite samples. By contrast, RC is approximately optimal whenever the coverage constraint is slack or costs are homogeneous, while misallocation arises only when cost heterogeneity interacts with a binding coverage constraint. Monte Carlo evidence supports these findings.