Computational Learning Theory
Efficient Algorithms for Learning and Compressing Monophonic Halfspaces in Graphs
Bressan, Marco, Chepoi, Victor, Esposito, Emmanuel, Thiessen, Maximilian
Abstract notions of convexity over the vertices of a graph, and corresponding notions of halfspaces, have recently gained attention from the machine learning community. In this work we study monophonic halfspaces, a notion of graph halfspaces defined through closure under induced paths. Our main result is a $2$-satisfiability based decomposition theorem, which allows one to represent monophonic halfspaces as a disjoint union of certain vertex subsets. Using this decomposition, we achieve efficient and (nearly) optimal algorithms for various learning problems, such as teaching, active, and online learning. Most notably, we obtain a polynomial-time algorithm for empirical risk minimization. Independently of the decomposition theorem, we obtain an efficient, stable, and proper sample compression scheme. This makes monophonic halfspaces efficiently learnable with proper learners and linear error rate $1/\varepsilon$ in the realizable PAC setting. Our results answer open questions from the literature, and show a stark contrast with geodesic halfspaces, for which most of the said learning problems are NP-hard.
Less Greedy Equivalence Search
Ejaz, Adiba, Bareinboim, Elias
Greedy Equivalence Search (GES) is a classic score-based algorithm for causal discovery from observational data. In the sample limit, it recovers the Markov equivalence class of graphs that describe the data. Still, it faces two challenges in practice: computational cost and finite-sample accuracy. In this paper, we develop Less Greedy Equivalence Search (LGES), a variant of GES that retains its theoretical guarantees while partially addressing these limitations. LGES modifies the greedy step: rather than always applying the highest-scoring insertion, it avoids edge insertions between variables for which the score implies some conditional independence. This more targeted search yields up to a \(10\)-fold speed-up and a substantial reduction in structural error relative to GES. Moreover, LGES can guide the search using prior assumptions, while correcting these assumptions when contradicted by the data. Finally, LGES can exploit interventional data to refine the learned observational equivalence class. We prove that LGES recovers the true equivalence class in the sample limit from observational and interventional data, even with misspecified prior assumptions. Experiments demonstrate that LGES outperforms GES and other baselines in speed, accuracy, and robustness to misspecified assumptions. Our code is available at https://github.com/CausalAILab/lges.
Towards Fundamental Limits for Active Multi-distribution Learning
Multi-distribution learning extends agnostic Probably Approximately Correct (PAC) learning to the setting in which a family of $k$ distributions, $\{D_i\}_{i\in[k]}$, is considered and a classifier's performance is measured by its error under the worst distribution. This problem has attracted a lot of recent interests due to its applications in collaborative learning, fairness, and robustness. Despite a rather complete picture of sample complexity of passive multi-distribution learning, research on active multi-distribution learning remains scarce, with algorithms whose optimality remaining unknown. In this paper, we develop new algorithms for active multi-distribution learning and establish improved label complexity upper and lower bounds, in distribution-dependent and distribution-free settings. Specifically, in the near-realizable setting we prove an upper bound of $\widetilde{O}\Bigl(θ_{\max}(d+k)\ln\frac{1}{\varepsilon}\Bigr)$ and $\widetilde{O}\Bigl(θ_{\max}(d+k)\Bigl(\ln\frac{1}{\varepsilon}+\frac{ν^2}{\varepsilon^2}\Bigr)+\frac{kν}{\varepsilon^2}\Bigr)$ in the realizable and agnostic settings respectively, where $θ_{\max}$ is the maximum disagreement coefficient among the $k$ distributions, $d$ is the VC dimension of the hypothesis class, $ν$ is the multi-distribution error of the best hypothesis, and $\varepsilon$ is the target excess error. Moreover, we show that the bound in the realizable setting is information-theoretically optimal and that the $kν/\varepsilon^2$ term in the agnostic setting is fundamental for proper learners. We also establish instance-dependent sample complexity bound for passive multidistribution learning that smoothly interpolates between realizable and agnostic regimes~\citep{blum2017collaborative,zhang2024optimal}, which may be of independent interest.
Beyond Prediction -- Structuring Epistemic Integrity in Artificial Reasoning Systems
This paper outlines a comprehensive theoretical and architectural framework for constructing epistemically grounded artificial intelligence systems capable of propositional commitment, metacognitive reasoning, contradiction detection, and normative truth maintenance. Moving beyond the constraints of stochastic language generation, we propose a model in which artificial agents engage in structured, rule-governed reasoning that adheres to explicit epistemic norms. The approach integrates insights from epistemology, formal logic, inferential semantics, knowledge graph structuring, probabilistic justification, and immutable blockchain evidence to create systems that do not merely simulate knowledge, but operate under explicit, verifiable constraints on belief, justification, and truth. We begin with an analysis of epistemic norms in artificial reasoning, contrasting evi-dentialist, Bayesian, and logical foundations, and establishing a requirement for internal consistency and constraint against falsehood. Central to the proposed system is a prohibition against internal deception: no model component may assert what it internally contradicts.
How Many Domains Suffice for Domain Generalization? A Tight Characterization via the Domain Shattering Dimension
Dwork, Cynthia, Hu, Lunjia, Shao, Han
The ability to generalize across domains is an important component of human intelligence and a crucial milestone in the development of increasingly powerful artificial intelligence. It is not surprising that an experienced driver in one country can reasonably, albeit imperfectly, drive in another country even without further training. It is also not surprising that a skilled chess player can outperform an average person on a totally different board game. An expert doctor who has studied a disease using data from a few large hospitals can potentially provide reasonable treatment for patients in a geographically remote and biologically different population that does not have access to those large hospitals and has not been the subject of prior study. Such domain generalization abilities are empowered by the capability of the learner (i.e., the driver, chess player, or doctor) to distill and master universal laws (about driving, game playing, or medical treatment) that hold even on unseen domains, while separating them from idiosyncratic patterns that are domain-specific. In this work, we study the theoretical foundations of domain generalization with a focus on capturing this ability of learning "universal laws" while not overfitting to the "idiosyncratic patterns". Consider a family G of domains, where every domain D is a distribution on examples ( x, y) each consisting of an instance x X and a binary label y { 0, 1 }. On a specific domain D, there may be a hypothesis (i.e., classifier) h: X { 0, 1 } with low classification error err
A Distributional-Lifting Theorem for PAC Learning
Blanc, Guy, Lange, Jane, Strassle, Carmen, Tan, Li-Yang
The apparent difficulty of efficient distribution-free PAC learning has led to a large body of work on distribution-specific learning. Distributional assumptions facilitate the design of efficient algorithms but also limit their reach and relevance. Towards addressing this, we prove a distributional-lifting theorem: This upgrades a learner that succeeds with respect to a limited distribution family $\mathcal{D}$ to one that succeeds with respect to any distribution $D^\star$, with an efficiency overhead that scales with the complexity of expressing $D^\star$ as a mixture of distributions in $\mathcal{D}$. Recent work of Blanc, Lange, Malik, and Tan considered the special case of lifting uniform-distribution learners and designed a lifter that uses a conditional sample oracle for $D^\star$, a strong form of access not afforded by the standard PAC model. Their approach, which draws on ideas from semi-supervised learning, first learns $D^\star$ and then uses this information to lift. We show that their approach is information-theoretically intractable with access only to random examples, thereby giving formal justification for their use of the conditional sample oracle. We then take a different approach that sidesteps the need to learn $D^\star$, yielding a lifter that works in the standard PAC model and enjoys additional advantages: it works for all base distribution families, preserves the noise tolerance of learners, has better sample complexity, and is simpler.
Bridging Pattern-Aware Complexity with NP-Hard Optimization: A Unifying Framework and Empirical Study
NP hard optimization problems like the Traveling Salesman Problem (TSP) defy efficient solutions in the worst case, yet real-world instances often exhibit exploitable patterns. We propose a novel patternaware complexity framework that quantifies and leverages structural regularities e.g., clustering, symmetry to reduce effective computational complexity across domains, including financial forecasting and LLM optimization. With rigorous definitions, theorems, and a meta learning driven solver pipeline, we introduce metrics like Pattern Utilization Efficiency (PUE) and achieve up to 79 percent solution quality gains in TSP benchmarks (22 to 2392 cities). Distinct from theoretical NP hardness, our approach offers a unified, practical lens for pattern-driven efficiency.
On the Hardness of Bandit Learning
Brukhim, Nataly, Pacchiano, Aldo, Dudik, Miroslav, Schapire, Robert
We study the task of bandit learning, also known as best-arm identification, under the assumption that the true reward function f belongs to a known, but arbitrary, function class F. We seek a general theory of bandit learnability, akin to the PAC framework for classification. Our investigation is guided by the following two questions: (1) which classes F are learnable, and (2) how they are learnable. For example, in the case of binary PAC classification, learnability is fully determined by a combinatorial dimension - the VC dimension- and can be attained via a simple algorithmic principle, namely, empirical risk minimization (ERM). In contrast to classical learning-theoretic results, our findings reveal limitations of learning in structured bandits, offering insights into the boundaries of bandit learnability. First, for the question of "which", we show that the paradigm of identifying the learnable classes via a dimension-like quantity fails for bandit learning. We give a simple proof demonstrating that no combinatorial dimension can characterize bandit learnability, even in finite classes, following a standard definition of dimension introduced by Ben-David et al. (2019). For the question of "how", we prove a computational hardness result: we construct a reward function class for which at most two queries are needed to find the optimal action, yet no algorithm can do so in polynomial time unless RP=NP. We also prove that this class admits efficient algorithms for standard algorithmic operations often considered in learning theory, such as an ERM. This implies that computational hardness is in this case inherent to the task of bandit learning. Beyond these results, we investigate additional themes such as learning under noise, trade-offs between noise models, and the relationship between query complexity and regret minimization.
The Space Complexity of Learning-Unlearning Algorithms
Cherapanamjeri, Yeshwanth, Garg, Sumegha, Rajaraman, Nived, Sekhari, Ayush, Shetty, Abhishek
We study the memory complexity of machine unlearning algorithms that provide strong data deletion guarantees to the users. Formally, consider an algorithm for a particular learning task that initially receives a training dataset. Then, after learning, it receives data deletion requests from a subset of users (of arbitrary size), and the goal of unlearning is to perform the task as if the learner never received the data of deleted users. In this paper, we ask how many bits of storage are needed to be able to delete certain training samples at a later time. We focus on the task of realizability testing, where the goal is to check whether the remaining training samples are realizable within a given hypothesis class \(\mathcal{H}\). Toward that end, we first provide a negative result showing that the VC dimension is not a characterization of the space complexity of unlearning. In particular, we provide a hypothesis class with constant VC dimension (and Littlestone dimension), but for which any unlearning algorithm for realizability testing needs to store \(Ω(n)\)-bits, where \(n\) denotes the size of the initial training dataset. In fact, we provide a stronger separation by showing that for any hypothesis class \(\mathcal{H}\), the amount of information that the learner needs to store, so as to perform unlearning later, is lower bounded by the \textit{eluder dimension} of \(\mathcal{H}\), a combinatorial notion always larger than the VC dimension. We complement the lower bound with an upper bound in terms of the star number of the underlying hypothesis class, albeit in a stronger ticketed-memory model proposed by Ghazi et al. (2023). Since the star number for a hypothesis class is never larger than its Eluder dimension, our work highlights a fundamental separation between central and ticketed memory models for machine unlearning.
Quantum AIXI: Universal Intelligence via Quantum Information
AIXI is a widely studied model of artificial general intelligence (AGI) based upon principles of induction and reinforcement learning. However, AIXI is fundamentally classical in nature - as are the environments in which it is modelled. Given the universe is quantum mechanical in nature and the exponential overhead required to simulate quantum mechanical systems classically, the question arises as to whether there are quantum mechanical analogues of AIXI. To address this question, we extend the framework to quantum information and present Quantum AIXI (QAIXI). We introduce a model of quantum agent/environment interaction based upon quantum and classical registers and channels, showing how quantum AIXI agents may take both classical and quantum actions. We formulate the key components of AIXI in quantum information terms, extending previous research on quantum Kolmogorov complexity and a QAIXI value function. We discuss conditions and limitations upon quantum Solomonoff induction and show how contextuality fundamentally affects QAIXI models.