Computational Learning Theory: Overviews

Nonparametric Quantile-Based Causal Discovery Machine Learning

Telling cause from effect using observational data is a challenging problem, especially in the bivariate case. Contemporary methods often assume an independence between the cause and the generating mechanism of the effect given the cause. From this postulate, they derive asymmetries to uncover causal relationships. In this work, we propose such an approach, based on the link between Kolmogorov complexity and quantile scoring. We use a nonparametric conditional quantile estimator based on copulas to implement our procedure, thus avoiding restrictive assumptions about the joint distribution between cause and effect. In an extensive study on real and synthetic data, we show that quantile copula causal discovery (QCCD) compares favorably to state-of-the-art methods, while at the same time being computationally efficient and scalable.

SAT Competition 2016: Recent Developments

AAAI Conferences

We give an overview of SAT Competition 2016, the 2016 edition of thefamous competition for Boolean satisfiability (SAT) solvers with over 20 years of history. A key aim is to point out ``what's hot'' in SAT competitions in 2016, i.e., new developments in thecompetition series, including new competition tracks and new solver techniquesimplemented in some of the award-winning solvers.

Clustering with Same-Cluster Queries

Neural Information Processing Systems

We propose a framework for Semi-Supervised Active Clustering framework (SSAC), where the learner is allowed to interact with a domain expert, asking whether two given instances belong to the same cluster or not. We study the query and computational complexity of clustering in this framework. We consider a setting where the expert conforms to a center-based clustering with a notion of margin. We show that there is a trade off between computational complexity and query complexity; We prove that for the case of $k$-means clustering (i.e., when the expert conforms to a solution of $k$-means), having access to relatively few such queries allows efficient solutions to otherwise NP hard problems. In particular, we provide a probabilistic polynomial-time (BPP) algorithm for clustering in this setting that asks $O\big(k^2\log k + k\log n)$ same-cluster queries and runs with time complexity $O\big(kn\log n)$ (where $k$ is the number of clusters and $n$ is the number of instances). The success of the algorithm is guaranteed for data satisfying the margin condition under which, without queries, we show that the problem is NP hard. We also prove a lower bound on the number of queries needed to have a computationally efficient clustering algorithm in this setting.

Cross: Efficient Low-rank Tensor Completion Machine Learning

The completion of tensors, or high-order arrays, attracts significant attention in recent research. Current literature on tensor completion primarily focuses on recovery from a set of uniformly randomly measured entries, and the required number of measurements to achieve recovery is not guaranteed to be optimal. In addition, the implementation of some previous methods are NP-hard. In this article, we propose a framework for low-rank tensor completion via a novel tensor measurement scheme we name Cross. The proposed procedure is efficient and easy to implement. In particular, we show that a third order tensor of Tucker rank-$(r_1, r_2, r_3)$ in $p_1$-by-$p_2$-by-$p_3$ dimensional space can be recovered from as few as $r_1r_2r_3 + r_1(p_1-r_1) + r_2(p_2-r_2) + r_3(p_3-r_3)$ noiseless measurements, which matches the sample complexity lower-bound. In the case of noisy measurements, we also develop a theoretical upper bound and the matching minimax lower bound for recovery error over certain classes of low-rank tensors for the proposed procedure. The results can be further extended to fourth or higher-order tensors. Simulation studies show that the method performs well under a variety of settings. Finally, the procedure is illustrated through a real dataset in neuroimaging.

SAT-based Analysis of Large Real-world Feature Models is Easy Artificial Intelligence

Modern conflict-driven clause-learning (CDCL) Boolean SAT solvers provide efficient automatic analysis of real-world feature models (FM) of systems ranging from cars to operating systems. It is well-known that solver-based analysis of real-world FMs scale very well even though SAT instances obtained from such FMs are large, and the corresponding analysis problems are known to be NP-complete. To better understand why SAT solvers are so effective, we systematically studied many syntactic and semantic characteristics of a representative set of large real-world FMs. We discovered that a key reason why large real-world FMs are easy-to-analyze is that the vast majority of the variables in these models are unrestricted, i.e., the models are satisfiable for both true and false assignments to such variables under the current partial assignment. Given this discovery and our understanding of CDCL SAT solvers, we show that solvers can easily find satisfying assignments for such models without too many backtracks relative to the model size, explaining why solvers scale so well. Further analysis showed that the presence of unrestricted variables in these real-world models can be attributed to their high-degree of variability. Additionally, we experimented with a series of well-known non-backtracking simplifications that are particularly effective in solving FMs. The remaining variables/clauses after simplifications, called the core, are so few that they are easily solved even with backtracking, further strengthening our conclusions.

SAT Modulo Monotonic Theories

AAAI Conferences

Boolean satisfiability (SAT) solvers have been successfully applied to a wide variety of difficult combinatorial problems. Many further problems can be solved by SAT Modulo Theory (SMT) solvers, which extend SAT solvers to handle additional types of constraints. However, building efficient SMT solvers is often very difficult. In this paper, we define the concept of a Boolean monotonic theory and show how to easily build efficient SMT solvers, including effective theory propagation and clause learning, for such theories. We present examples showing useful constraints that are monotonic, including many graph properties (e.g., shortest paths), and geometric properties (e.g., convex hulls). These constraints arise in problems that are otherwise difficult for SAT solvers to handle, such as procedural content generation. We have implemented several monotonic theory solvers using the techniques we present in this paper and applied these to content generation problems, demonstrating major speed-ups over SAT, SMT, and Answer Set Programming solvers, easily solving instances that were previously out of reach.

Towards Ultra Rapid Restarts Artificial Intelligence

We observe a trend regarding restart strategies used in SAT solvers. A few years ago, most state-of-the-art solvers restarted on average after a few thousands of backtracks. Currently, restarting after a dozen backtracks results in much better performance. The main reason for this trend is that heuristics and data structures have become more restart-friendly. We expect further continuation of this trend, so future SAT solvers will restart even more rapidly. Additionally, we present experimental results to support our observations.

Finding the True Frequent Itemsets Machine Learning

Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction $\theta$ of a transactional dataset $\mathcal{D}$. Often though, the ultimate goal of mining $\mathcal{D}$ is not an analysis of the dataset \emph{per se}, but the understanding of the underlying process that generated it. Specifically, in many applications $\mathcal{D}$ is a collection of samples obtained from an unknown probability distribution $\pi$ on transactions, and by extracting the FIs in $\mathcal{D}$ one attempts to infer itemsets that are frequently (i.e., with probability at least $\theta$) generated by $\pi$, which we call the True Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the generative process, the set of FIs is only a rough approximation of the set of TFIs, as it often contains a huge number of \emph{false positives}, i.e., spurious itemsets that are not among the TFIs. In this work we design and analyze an algorithm to identify a threshold $\hat{\theta}$ such that the collection of itemsets with frequency at least $\hat{\theta}$ in $\mathcal{D}$ contains only TFIs with probability at least $1-\delta$, for some user-specified $\delta$. Our method uses results from statistical learning theory involving the (empirical) VC-dimension of the problem at hand. This allows us to identify almost all the TFIs without including any false positive. We also experimentally compare our method with the direct mining of $\mathcal{D}$ at frequency $\theta$ and with techniques based on widely-used standard bounds (i.e., the Chernoff bounds) of the binomial distribution, and show that our algorithm outperforms these methods and achieves even better results than what is guaranteed by the theoretical analysis.

The International SAT Solver Competitions

AI Magazine

The International SAT Solver Competition is today an established series of competitive events aiming at objectively evaluating the progress in state-of-the-art procedures for solving Boolean satisfiability (SAT) instances. Over the years, the competitions have significantly contributed to the fast progress in SAT solver technology that has made SAT a practical success story of computer science. This short article provides an overview of the SAT solver competitions.

Fast d-DNNF Compilation with sharpSAT

AAAI Conferences

Knowledge compilation is a valuable tool for dealing with the computational intractability of propositional reasoning. In knowledge compilation, a representation in a source language is typically compiled into a target language in order to perform some reasoning task in polynomial time. One particularly popular target language is Deterministic Decomposable Negation Normal Form (d-DNNF). d-DNNF supports efficient reasoning for tasks such as consistency checking and model counting, and as such it has proven a useful representation language for Bayesian inference, conformant planning, and diagnosis. In this paper, we exploit recent advances in #SAT solving in order to produce a new state-of-the-art CNF → d-DNNF compiler. We evaluate the properties and performance of our compiler relative to C2D, the de facto standard for compiling to d-DNNF. Empirical results demonstrate that our compiler is generally one order of magnitude faster than C2D on typical benchmark problems while yielding a d-DNNF representation of comparable size.