Computational Learning Theory
Too much information: CDCL solvers need to forget and perform restarts
Krüger, Tom, Lorenz, Jan-Hendrik, Wörz, Florian
Conflict-driven clause learning (CDCL) is a remarkably successful paradigm for solving the satisfiability problem of propositional logic. Instead of a simple depth-first backtracking approach, this kind of solver learns the reason behind occurring conflicts in the form of additional clauses. However, despite the enormous success of CDCL solvers, there is still only a shallow understanding of what influences the performance of these solvers in what way. This paper will demonstrate, quite surprisingly, that clause learning (without being able to get rid of some clauses) can not only improve the runtime but can oftentimes deteriorate it dramatically. By conducting extensive empirical analysis, we find that the runtime distributions of CDCL solvers are multimodal. This multimodality can be seen as a reason for the deterioration phenomenon described above. Simultaneously, it also gives an indication of why clause learning in combination with clause deletion and restarts is virtually the de facto standard of SAT solving in spite of this phenomenon. As a final contribution, we will show that Weibull mixture distributions can accurately describe the multimodal distributions. Thus, adding new clauses to a base instance has an inherent effect of making runtimes long-tailed. This insight provides a theoretical explanation as to why the techniques of restarts and clause deletion are useful in CDCL solvers.
Quantifying Relevance in Learning and Inference
Marsili, Matteo, Roudi, Yasser
Learning is a distinctive feature of intelligent behaviour. High-throughput experimental data and Big Data promise to open new windows on complex systems such as cells, the brain or our societies. Yet, the puzzling success of Artificial Intelligence and Machine Learning shows that we still have a poor conceptual understanding of learning. These applications push statistical inference into uncharted territories where data is high-dimensional and scarce, and prior information on "true" models is scant if not totally absent. Here we review recent progress on understanding learning, based on the notion of "relevance". The relevance, as we define it here, quantifies the amount of information that a dataset or the internal representation of a learning machine contains on the generative model of the data. This allows us to define maximally informative samples, on one hand, and optimal learning machines on the other. These are ideal limits of samples and of machines, that contain the maximal amount of information about the unknown generative process, at a given resolution (or level of compression). Both ideal limits exhibit critical features in the statistical sense: Maximally informative samples are characterised by a power-law frequency distribution (statistical criticality) and optimal learning machines by an anomalously large susceptibility. The trade-off between resolution (i.e. compression) and relevance distinguishes the regime of noisy representations from that of lossy compression. These are separated by a special point characterised by Zipf's law statistics. This identifies samples obeying Zipf's law as the most compressed loss-less representations that are optimal in the sense of maximal relevance. Criticality in optimal learning machines manifests in an exponential degeneracy of energy levels, that leads to unusual thermodynamic properties.
Active Learning Polynomial Threshold Functions
Ben-Eliezer, Omri, Hopkins, Max, Yang, Chutong, Yu, Hantao
We initiate the study of active learning polynomial threshold functions (PTFs). While traditional lower bounds imply that even univariate quadratics cannot be non-trivially actively learned, we show that allowing the learner basic access to the derivatives of the underlying classifier circumvents this issue and leads to a computationally efficient algorithm for active learning degree-$d$ univariate PTFs in $\tilde{O}(d^3\log(1/\varepsilon\delta))$ queries. We also provide near-optimal algorithms and analyses for active learning PTFs in several average case settings. Finally, we prove that access to derivatives is insufficient for active learning multivariate PTFs, even those of just two variables.
Hyperplane bounds for neural feature mappings
When minimising the empirical risk, the generalisation of the learnt function still depends on the performance on the training data, the Vapnik-Chervonenkis(VC)- dimension of the function and the number of training examples. Neural networks have a large number of parameters, which correlates with their VC-dimension that is typically large but not infinite, and typically a large number of training instances are needed to effectively train them. In this work, we explore how to optimize feature mappings using neural network with the intention to reduce the effective VC-dimension of the hyperplane found in the space generatedby the mapping. An interpretationofthe resultsofthis study isthat it ispossible to define a loss that controls the VC-dimension of the separating hyperplane. We evaluate this approach and observe that the performance when using this method improves when the size of the training set is small.
Exact learning for infinite families of concepts
In this paper, based on results of exact learning, test theory, and rough set theory, we study arbitrary infinite families of concepts each of which consists of an infinite set of elements and an infinite set of subsets of this set called concepts. We consider the notion of a problem over a family of concepts that is described by a finite number of elements: for a given concept, we should recognize which of the elements under consideration belong to this concept. As algorithms for problem solving, we consider decision trees of five types: (i) using membership queries, (ii) using equivalence queries, (iii) using both membership and equivalence queries, (iv) using proper equivalence queries, and (v) using both membership and proper equivalence queries. As time complexity, we study the depth of decision trees. In the worst case, with the growth of the number of elements in the problem description, the minimum depth of decision trees of the first type either grows as a logarithm or linearly, and the minimum depth of decision trees of each of the other types either is bounded from above by a constant or grows as a logarithm, or linearly. The obtained results allow us to distinguish seven complexity classes of infinite families of concepts.
Pinaki Laskar on LinkedIn: #AI #MachineLearning #DeepLearning
AI Researcher, Cognitive Technologist Inventor - AI Thinking, Think Chain Innovator - AIOT, XAI, Autonomous Cars, IIOT Founder Fisheyebox Spatial Computing Savant, Transformative Leader, Industry X.0 Practitioner Meta-AI is about modeling and simulating reality, causality, mentality by digital technologies. It is key source of data is science as the sum of universal knowledge, all the world's information as coordinated and systematized. It is typically divided into three major branches that consist of the following, - the natural sciences (e.g., biology, chemistry, and physics), which study nature in the broadest sense; - the social sciences (e.g., economics, psychology, and sociology), which study individuals and societies; - the formal sciences (e.g., logic, mathematics, and theoretical computer science), which deal with symbols governed by rules; As to empiricism, stating that knowledge comes only or primarily from sensory experience, both the philosophical sciences and the formal sciences as well as mathematics are out of any science as they do not rely on empirical evidence. It is plain and clear, data or information or knowledge have real value if only coordinated and systematized and organized. Again, drawing on pattern recognition and computational learning theory, Meta-ML is dedicated to the study of problem-solving by computer programs in general, enabling computers to reason about the world and learn from data, to effectively interact with any realities, physical, mental, social, digital, or virtual. AI/ML/DL modelling should consist of the following necessary features, Meta-physical Assumptions: prior knowledge, the basis of our knowing, understanding, or thinking about the whole world or a domain problem (primary causes, principles and elements).
ML Supported Predictions for SAT Solvers Performance
Leventi-Peetz, A. -M., Peetz, Jörg-Volker, Rohde, Martina
In order to classify the indeterministic termination behavior of the open source SAT solver CryptoMiniSat in multi-threading mode while processing hard to solve boolean satisfiability problem instances, internal solver runtime parameters have been collected and analyzed. A subset of these parameters has been selected and employed as features vector to successfully create a machine learning model for the binary classification of the solver's termination behavior with any single new solving run of a not yet solved instance. The model can be used for the early estimation of a solving attempt as belonging or not belonging to the class of candidates with good chances for a fast termination. In this context a combination of active profiles of runtime characteristics appear to mirror the influence of the solver's momentary heuristics on the immediate quality of the solver's resolution process. Because runtime parameters of already the first two solving iterations are enough to forecast termination of the attempt with good success scores, the results of the present work deliver a promising basis which can be further developed in order to enrich CryptoMiniSat or generally any modern SAT solver with AI abilities.
Teaching an Active Learner with Contrastive Examples
Wang, Chaoqi, Singla, Adish, Chen, Yuxin
We study the problem of active learning with the added twist that the learner is assisted by a helpful teacher. We consider the following natural interaction protocol: At each round, the learner proposes a query asking for the label of an instance $x^q$, the teacher provides the requested label $\{x^q, y^q\}$ along with explanatory information to guide the learning process. In this paper, we view this information in the form of an additional contrastive example ($\{x^c, y^c\}$) where $x^c$ is picked from a set constrained by $x^q$ (e.g., dissimilar instances with the same label). Our focus is to design a teaching algorithm that can provide an informative sequence of contrastive examples to the learner to speed up the learning process. We show that this leads to a challenging sequence optimization problem where the algorithm's choices at a given round depend on the history of interactions. We investigate an efficient teaching algorithm that adaptively picks these contrastive examples. We derive strong performance guarantees for our algorithm based on two problem-dependent parameters and further show that for specific types of active learners (e.g., a generalized binary search learner), the proposed teaching algorithm exhibits strong approximation guarantees. Finally, we illustrate our bounds and demonstrate the effectiveness of our teaching framework via two numerical case studies.
Effective dimension of machine learning models
Abbas, Amira, Sutter, David, Figalli, Alessio, Woerner, Stefan
Making statements about the performance of trained models on tasks involving new data is one of the primary goals of machine learning, i.e., to understand the generalization power of a model. Various capacity measures try to capture this ability, but usually fall short in explaining important characteristics of models that we observe in practice. In this study, we propose the local effective dimension as a capacity measure which seems to correlate well with generalization error on standard data sets. Importantly, we prove that the local effective dimension bounds the generalization error and discuss the aptness of this capacity measure for machine learning models.
First Steps of an Approach to the ARC Challenge based on Descriptive Grid Models and the Minimum Description Length Principle
The Abstraction and Reasoning Corpus (ARC) was recently introduced by Fran\c{c}ois Chollet as a tool to measure broad intelligence in both humans and machines. It is very challenging, and the best approach in a Kaggle competition could only solve 20% of the tasks, relying on brute-force search for chains of hand-crafted transformations. In this paper, we present the first steps exploring an approach based on descriptive grid models and the Minimum Description Length (MDL) principle. The grid models describe the contents of a grid, and support both parsing grids and generating grids. The MDL principle is used to guide the search for good models, i.e. models that compress the grids the most. We report on our progress over a year, improving on the general approach and the models. Out of the 400 training tasks, our performance increased from 5 to 29 solved tasks, only using 30s computation time per task. Our approach not only predicts the output grids, but also outputs an intelligible model and explanations for how the model was incrementally built.