Computational Learning Theory
Causality Networks
While correlation measures are used to discern statistical relationships between observed variables in almost all branches of data-driven scientific inquiry, what we are really interested in is the existence of causal dependence. Designing an efficient causality test, that may be carried out in the absence of restrictive pre-suppositions on the underlying dynamical structure of the data at hand, is non-trivial. Nevertheless, ability to computationally infer statistical prima facie evidence of causal dependence may yield a far more discriminative tool for data analysis compared to the calculation of simple correlations. In the present work, we present a new non-parametric test of Granger causality for quantized or symbolic data streams generated by ergodic stationary sources. In contrast to state-of-art binary tests, our approach makes precise and computes the degree of causal dependence between data streams, without making any restrictive assumptions, linearity or otherwise. Additionally, without any a priori imposition of specific dynamical structure, we infer explicit generative models of causal cross-dependence, which may be then used for prediction. These explicit models are represented as generalized probabilistic automata, referred to crossed automata, and are shown to be sufficient to capture a fairly general class of causal dependence. The proposed algorithms are computationally efficient in the PAC sense; $i.e.$, we find good models of cross-dependence with high probability, with polynomial run-times and sample complexities. The theoretical results are applied to weekly search-frequency data from Google Trends API for a chosen set of socially "charged" keywords. The causality network inferred from this dataset reveals, quite expectedly, the causal importance of certain keywords. It is also illustrated that correlation analysis fails to gather such insight.
SMML estimators for exponential families with continuous sufficient statistics
The minimum message length principle is an information theoretic criterion that links data compression with statistical inference. This paper studies the strict minimum message length (SMML) estimator for $d$-dimensional exponential families with continuous sufficient statistics, for all $d \ge 1$. The partition of an SMML estimator is shown to consist of convex polytopes (i.e. convex polygons when $d=2$) which can be described explicitly in terms of the assertions and coding probabilities. While this result is known, we give a new proof based on the calculus of variations, and this approach gives some interesting new inequalities for SMML estimators. We also use this result to construct an SMML estimator for a $2$-dimensional normal random variable with known variance and a normal prior on its mean.
A convergence proof of the split Bregman method for regularized least-squares problems
Nien, Hung, Fessler, Jeffrey A.
The split Bregman (SB) method [T. Goldstein and S. Osher, SIAM J. Imaging Sci., 2 (2009), pp. 323-43] is a fast splitting-based algorithm that solves image reconstruction problems with general l1, e.g., total-variation (TV) and compressed sensing (CS), regularizations by introducing a single variable split to decouple the data-fitting term and the regularization term, yielding simple subproblems that are separable (or partially separable) and easy to minimize. Several convergence proofs have been proposed, and these proofs either impose a "full column rank" assumption to the split or assume exact updates in all subproblems. However, these assumptions are impractical in many applications such as the X-ray computed tomography (CT) image reconstructions, where the inner least-squares problem usually cannot be solved efficiently due to the highly shift-variant Hessian. In this paper, we show that when the data-fitting term is quadratic, the SB method is a convergent alternating direction method of multipliers (ADMM), and a straightforward convergence proof with inexact updates is given using [J. Eckstein and D. P. Bertsekas, Mathematical Programming, 55 (1992), pp. 293-318, Theorem 8]. Furthermore, since the SB method is just a special case of an ADMM algorithm, it seems likely that the ADMM algorithm will be faster than the SB method if the augmented Largangian (AL) penalty parameters are selected appropriately. To have a concrete example, we conduct a convergence rate analysis of the ADMM algorithm using two splits for image restoration problems with quadratic data-fitting term and regularization term. According to our analysis, we can show that the two-split ADMM algorithm can be faster than the SB method if the AL penalty parameter of the SB method is suboptimal. Numerical experiments were conducted to verify our analysis.
Towards Ultra Rapid Restarts
We observe a trend regarding restart strategies used in SAT solvers. A few years ago, most state-of-the-art solvers restarted on average after a few thousands of backtracks. Currently, restarting after a dozen backtracks results in much better performance. The main reason for this trend is that heuristics and data structures have become more restart-friendly. We expect further continuation of this trend, so future SAT solvers will restart even more rapidly. Additionally, we present experimental results to support our observations.
Concurrent Cube-and-Conquer
van der Tak, Peter, Heule, Marijn J. H., Biere, Armin
Recent work introduced the cube-and-conquer technique to solve hard SAT instances. It partitions the search space into cubes using a lookahead solver. Each cube is tackled by a conflict-driven clause learning (CDCL) solver. Crucial for strong performance is the cutoff heuristic that decides when to switch from lookahead to CDCL. Yet, this offline heuristic is far from ideal. In this paper, we present a novel hybrid solver that applies the cube and conquer steps simultaneously. A lookahead and a CDCL solver work together on each cube, while communication is restricted to synchronization. Our concurrent cube-and-conquer solver can solve many instances faster than pure lookahead, pure CDCL and offline cube-and-conquer, and can abort early in favor of a pure CDCL search if an instance is not suitable for cube-and-conquer techniques.
From Small-World Networks to Comparison-Based Search
Karbasi, Amin, Ioannidis, Stratis, Massoulie, Laurent
The problem of content search through comparisons has recently received considerable attention. In short, a user searching for a target object navigates through a database in the following manner: the user is asked to select the object most similar to her target from a small list of objects. A new object list is then presented to the user based on her earlier selection. This process is repeated until the target is included in the list presented, at which point the search terminates. This problem is known to be strongly related to the small-world network design problem. However, contrary to prior work, which focuses on cases where objects in the database are equally popular, we consider here the case where the demand for objects may be heterogeneous. We show that, under heterogeneous demand, the small-world network design problem is NP-hard. Given the above negative result, we propose a novel mechanism for small-world design and provide an upper bound on its performance under heterogeneous demand. The above mechanism has a natural equivalent in the context of content search through comparisons, and we establish both an upper bound and a lower bound for the performance of this mechanism. These bounds are intuitively appealing, as they depend on the entropy of the demand as well as its doubling constant, a quantity capturing the topology of the set of target objects. They also illustrate interesting connections between comparison-based search to classic results from information theory. Finally, we propose an adaptive learning algorithm for content search that meets the performance guarantees achieved by the above mechanisms.
Bounding Embeddings of VC Classes into Maximum Classes
Rubinstein, J. Hyam, Rubinstein, Benjamin I. P., Bartlett, Peter L.
One of the earliest conjectures in computational learning theory-the Sample Compression conjecture-asserts that concept classes (equivalently set systems) admit compression schemes of size linear in their VC dimension. To-date this statement is known to be true for maximum classes---those that possess maximum cardinality for their VC dimension. The most promising approach to positively resolving the conjecture is by embedding general VC classes into maximum classes without super-linear increase to their VC dimensions, as such embeddings would extend the known compression schemes to all VC classes. We show that maximum classes can be characterised by a local-connectivity property of the graph obtained by viewing the class as a cubical complex. This geometric characterisation of maximum VC classes is applied to prove a negative embedding result which demonstrates VC-d classes that cannot be embedded in any maximum class of VC dimension lower than 2d. On the other hand, we show that every VC-d class C embeds in a VC-(d+D) maximum class where D is the deficiency of C, i.e., the difference between the cardinalities of a maximum VC-d class and of C. For VC-2 classes in binary n-cubes for 4 <= n <= 6, we give best possible results on embedding into maximum classes. For some special classes of Boolean functions, relationships with maximum classes are investigated. Finally we give a general recursive procedure for embedding VC-d classes into VC-(d+k) maximum classes for smallest k.
Finding the True Frequent Itemsets
Riondato, Matteo, Vandin, Fabio
Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction $\theta$ of a transactional dataset $\mathcal{D}$. Often though, the ultimate goal of mining $\mathcal{D}$ is not an analysis of the dataset \emph{per se}, but the understanding of the underlying process that generated it. Specifically, in many applications $\mathcal{D}$ is a collection of samples obtained from an unknown probability distribution $\pi$ on transactions, and by extracting the FIs in $\mathcal{D}$ one attempts to infer itemsets that are frequently (i.e., with probability at least $\theta$) generated by $\pi$, which we call the True Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the generative process, the set of FIs is only a rough approximation of the set of TFIs, as it often contains a huge number of \emph{false positives}, i.e., spurious itemsets that are not among the TFIs. In this work we design and analyze an algorithm to identify a threshold $\hat{\theta}$ such that the collection of itemsets with frequency at least $\hat{\theta}$ in $\mathcal{D}$ contains only TFIs with probability at least $1-\delta$, for some user-specified $\delta$. Our method uses results from statistical learning theory involving the (empirical) VC-dimension of the problem at hand. This allows us to identify almost all the TFIs without including any false positive. We also experimentally compare our method with the direct mining of $\mathcal{D}$ at frequency $\theta$ and with techniques based on widely-used standard bounds (i.e., the Chernoff bounds) of the binomial distribution, and show that our algorithm outperforms these methods and achieves even better results than what is guaranteed by the theoretical analysis.
Predictive PAC Learning and Process Decompositions
Shalizi, Cosma, Kontorovich, Aryeh
We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with finite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. This definition not only matches what a realistic learner might demand, but also allows us to sidestep several otherwise grave problems in learning from dependent data. In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. We also provide a characterization of mixtures of absolutely regular ($\beta$-mixing) processes, of independent interest.
More data speeds up training time in learning halfspaces over sparse vectors
Daniely, Amit, Linial, Nati, Shalev-Shwartz, Shai
The increased availability of data in recent years led several authors to ask whether it is possible to use data as a {\em computational} resource. That is, if more data is available, beyond the sample complexity limit, is it possible to use the extra examples to speed up the computation time required to perform the learning task? We give the first positive answer to this question for a {\em natural supervised learning problem} --- we consider agnostic PAC learning of halfspaces over $3$-sparse vectors in $\{-1,1,0\}^n$. This class is inefficiently learnable using $O\left(n/\epsilon^2\right)$ examples. Our main contribution is a novel, non-cryptographic, methodology for establishing computational-statistical gaps, which allows us to show that, under a widely believed assumption that refuting random $\mathrm{3CNF}$ formulas is hard, efficiently learning this class using $O\left(n/\epsilon^2\right)$ examples is impossible. We further show that under stronger hardness assumptions, even $O\left(n^{1.499}/\epsilon^2\right)$ examples do not suffice. On the other hand, we show a new algorithm that learns this class efficiently using $\tilde{\Omega}\left(n^2/\epsilon^2\right)$ examples. This formally establishes the tradeoff between sample and computational complexity for a natural supervised learning problem.