Computational Learning Theory
On characterizing optimal learning trajectories in a class of learning problems
In this brief paper, we provide a mathematical framework that exploits the relationship between the maximum principle and dynamic programming for characterizing optimal learning trajectories in a class of learning problem, which is related to point estimations for modeling of high-dimensional nonlinear functions. Here, such characterization for the optimal learning trajectories is associated with the solution of an optimal control problem for a weakly-controlled gradient system with small parameters, whose time-evolution is guided by a model training dataset and its perturbed version, while the optimization problem consists of a cost functional that summarizes how to gauge the quality/performance of the estimated model parameters at a certain fixed final time w.r.t. a model validating dataset. Moreover, using a successive Galerkin approximation method, we provide an algorithmic recipe how to construct the corresponding optimal learning trajectories leading to the optimal estimated model parameters for such a class of learning problem.
Blackwell's Approachability with Approximation Algorithms
We revisit Blackwell's celebrated approachability problem which considers a repeated vector-valued game between a player and an adversary. Motivated by settings in which the action set of the player or adversary (or both) is difficult to optimize over, for instance when it corresponds to the set of all possible solutions to some NP-Hard optimization problem, we ask what can the player guarantee \textit{efficiently}, when only having access to these sets via approximation algorithms with ratios $\alpha_{\mX} \geq 1$ and $ 1 \geq \alpha_{\mY} > 0$, respectively. Assuming the player has monotone preferences, in the sense that he does not prefer a vector-valued loss $\ell_1$ over $\ell_2$ if $\ell_2 \leq \ell_1$, we establish that given a Blackwell instance with an approachable target set $S$, the downward closure of the appropriately-scaled set $\alpha_{\mX}\alpha_{\mY}^{-1}S$ is \textit{efficiently} approachable with optimal rate. In case only the player's or adversary's set is equipped with an approximation algorithm, we give simpler and more efficient algorithms.
Sample Complexity of Bias Detection with Subsampled Point-to-Subspace Distances
Matilla, German Martinez, Marecek, Jakub
Sample complexity of bias estimation is a lower bound on the runtime of any bias detection method. Many regulatory frameworks require the bias to be tested for all subgroups, whose number grows exponentially with the number of protected attributes. Unless one wishes to run a bias detection with a doubly-exponential run-time, one should like to have polynomial complexity of bias detection for a single subgroup. At the same time, the reference data may be based on surveys, and thus come with non-trivial uncertainty. Here, we reformulate bias detection as a point-to-subspace problem on the space of measures and show that, for supremum norm, it can be subsampled efficiently. In particular, our probabilistically approximately correct (PAC) results are corroborated by tests on well-known instances.
Networks with Finite VC Dimension: Pro and Contra
Kurkova, Vera, Sanguineti, Marcello
Approximation and learning of classifiers of large data sets by neural networks in terms of high-dimensional geometry and statistical learning theory are investigated. The influence of the VC dimension of sets of input-output functions of networks on approximation capabilities is compared with its influence on consistency in learning from samples of data. It is shown that, whereas finite VC dimension is desirable for uniform convergence of empirical errors, it may not be desirable for approximation of functions drawn from a probability distribution modeling the likelihood that they occur in a given type of application. Based on the concentration-of-measure properties of high dimensional geometry, it is proven that both errors in approximation and empirical errors behave almost deterministically for networks implementing sets of input-output functions with finite VC dimensions in processing large data sets. Practical limitations of the universal approximation property, the trade-offs between the accuracy of approximation and consistency in learning from data, and the influence of depth of networks with ReLU units on their accuracy and consistency are discussed.
Deeply Optimizing the SAT Solver for the IC3 Algorithm
Su, Yuheng, Yang, Qiusong, Ci, Yiwei, Li, Yingcheng, Bu, Tianjun, Huang, Ziyu
The IC3 algorithm, also known as PDR, is a SAT-based model checking algorithm that has significantly influenced the field in recent years due to its efficiency, scalability, and completeness. It utilizes SAT solvers to solve a series of SAT queries associated with relative induction. In this paper, we introduce several optimizations for the SAT solver in IC3 based on our observations of the unique characteristics of these SAT queries. By observing that SAT queries do not necessarily require decisions on all variables, we compute a subset of variables that need to be decided before each solving process while ensuring that the result remains unaffected. Additionally, noting that the overhead of binary heap operations in VSIDS is non-negligible, we replace the binary heap with buckets to achieve constant-time operations. Furthermore, we support temporary clauses without the need to allocate a new activation variable for each solving process, thereby eliminating the need to reset solvers. We developed a novel lightweight CDCL SAT solver, GipSAT, which integrates these optimizations. A comprehensive evaluation highlights the performance improvements achieved by GipSAT. Specifically, the GipSAT-based IC3 demonstrates an average speedup of 3.61 times in solving time compared to the IC3 implementation based on MiniSat.
PAC Learning is just Bipartite Matching (Sort of)
The main goal of this article is to convince you, the reader, that supervised learning in the Probably Approximately Correct (PAC) model is closely related to -- of all things -- bipartite matching! En-route from PAC learning to bipartite matching, I will overview a particular transductive model of learning, and associated one-inclusion graphs, which can be viewed as a generalization of some of the hat puzzles that are popular in recreational mathematics. Whereas this transductive model is far from new, it has recently seen a resurgence of interest as a tool for tackling deep questions in learning theory. A secondary purpose of this article could be as a (biased) tutorial on the connections between the PAC and transductive models of learning.
Model Successor Functions
Chang, Yingshan, Bisk, Yonatan
The notion of generalization has moved away from the classical one defined in statistical learning theory towards an emphasis on out-of-domain generalization (OODG). Recently, there is a growing focus on inductive generalization, where a progression of difficulty implicitly governs the direction of domain shifts. In inductive generalization, it is often assumed that the training data lie in the easier side, while the testing data lie in the harder side. The challenge is that training data are always finite, but a learner is expected to infer an inductive principle that could be applied in an unbounded manner. This emerging regime has appeared in the literature under different names, such as length/logical/algorithmic extrapolation, but a formal definition is lacking. This work provides such a formalization that centers on the concept of model successors. Then we outline directions to adapt well-established techniques towards the learning of model successors. This work calls for restructuring of the research discussion around inductive generalization from fragmented task-centric communities to a more unified effort, focused on universal properties of learning and computation.
Near-Optimal Algorithms for Omniprediction
Okoroafor, Princewill, Kleinberg, Robert, Kim, Michael P.
Omnipredictors are simple prediction functions that encode loss-minimizing predictions with respect to a hypothesis class $\mathcal{H}$, simultaneously for every loss function within a class of losses $\mathcal{L}$. In this work, we give near-optimal learning algorithms for omniprediction, in both the online and offline settings. To begin, we give an oracle-efficient online learning algorithm that acheives $(\mathcal{L},\mathcal{H})$-omniprediction with $\tilde{O}(\sqrt{T \log |\mathcal{H}|})$ regret for any class of Lipschitz loss functions $\mathcal{L} \subseteq \mathcal{L}_\mathrm{Lip}$. Quite surprisingly, this regret bound matches the optimal regret for \emph{minimization of a single loss function} (up to a $\sqrt{\log(T)}$ factor). Given this online algorithm, we develop an online-to-offline conversion that achieves near-optimal complexity across a number of measures. In particular, for all bounded loss functions within the class of Bounded Variation losses $\mathcal{L}_\mathrm{BV}$ (which include all convex, all Lipschitz, and all proper losses) and any (possibly-infinite) $\mathcal{H}$, we obtain an offline learning algorithm that, leveraging an (offline) ERM oracle and $m$ samples from $\mathcal{D}$, returns an efficient $(\mathcal{L}_{\mathrm{BV}},\mathcal{H},\varepsilon(m))$-omnipredictor for $\varepsilon(m)$ scaling near-linearly in the Rademacher complexity of $\mathrm{Th} \circ \mathcal{H}$.
Review for NeurIPS paper: Reducing Adversarially Robust Learning to Non-Robust PAC Learning
Summary and Contributions: Robust learning has got a lot of attention over the last decade. This paper is about *test-time attacks (called evasion attacks or attacks finding "adversarial examples"). Such attacks are studied widely from experimental point of view, but more recently theoretical results are being proven for understanding how, and when, learning under such adversarial perturbations are possible. Montasser et al (MHS COLT 19) showed that if a problem is PAC learnable, i.e., has finite VC dimension d, then it is also robustly PAC learnable, though the sample complexity (in their proof) could be as large as exponential(d). This paper's main contribution is an alternative proof of the result of MHS, by showing how to reduce the task of robust learning to the task of normal PAC learning.
Review for NeurIPS paper: Reducing Adversarially Robust Learning to Non-Robust PAC Learning
All the reviewers agreed that the paper provides a novel and important theoretical result. Specifically, one of the main contributions of the paper is to show that if a problem is PAC learnable, i.e., has finite VC dimension d, then it i also robustly PAC learnable. The result had already been proven last year, however, this paper provides a novel framework to prove it by showing how to reduce the task of robust learning to the task of normal PAC learning (it should be noted that the reduction may not be efficient (i.e., polynomial time) but still is important from the theoretical and statistical perspective. The reviewers also had some suggestions for the revised version of the paper which are reflected in the'r updated reviews.