Computational Learning Theory
Stable Manipulation in Voting
Voting has served as a fundamental tool for aggregating preferences of a set of people over a set of alternatives from centuries. A typical voting system consists of a set of alternatives, a set of voters each having a linear order over the set of alternatives as her preference, and a voting rule which selects a set of alternatives as winners depending on the voters' preferences. However, classical results show that every reasonable voting system with at least 3 alternatives can suffer from manipulation [Gib73, Sat75] -- an agent may be able to make her more favored alternative win by misreporting her preference. Bartholdi et al. pioneered the idea of using computational intractability as a barrier to safe guard elections against manipulation [BTT89, BO91]. Indeed, if we have m alternatives and even if the manipulator exactly knows the preferences of all other voters, na ıvely going over all ( m! 1) possible preferences and report the one which results in best outcome for the manipulator is not feasible for any computationally bounded manipulator . Although the idea of Bartholdi et al. was to use computational intractability as a barrier against manipulation, the computational problem of manipulation turns out to be polynomial time solvable for most of the commonly used voting rules such as the scoring rules, maximin, Copeland, etc. with prominent exception being the single transferable (STV) voting rule. Even for voting rules (STV for example) for which the computational barrier exists against manipulation, it seems that the barrier, in reality, may be substantially weak due to existence of heuristics which work well in practice [FP10, FKKN11, MR15, and references there in]. Motivation: The computational problem of manipulation has mostly been studied in what is called the complete information setting -- the manipulator exactly knows the preferences of all other voters.
Minimum Description Length Revisited
This is an up-to-date introduction to and overview of the Minimum Description Length (MDL) Principle, a theory of inductive inference that can be applied to general problems in statistics, machine learning and pattern recognition. While MDL was originally based on data compression ideas, this introduction can be read without any knowledge thereof. It takes into account all major developments since 2007, the last time an extensive overview was written. These include new methods for model selection and averaging and hypothesis testing, as well as the first completely general definition of {\em MDL estimators}. Incorporating these developments, MDL can be seen as a powerful extension of both penalized likelihood and Bayesian approaches, in which penalization functions and prior distributions are replaced by more general luckiness functions, average-case methodology is replaced by a more robust worst-case approach, and in which methods classically viewed as highly distinct, such as AIC vs BIC and cross-validation vs Bayes can, to a large extent, be viewed from a unified perspective.
Efficient Truncated Statistics with Unknown Truncation
Kontonis, Vasilis, Tzamos, Christos, Zampetakis, Manolis
We study the problem of estimating the parameters of a Gaussian distribution when samples are only shown if they fall in some (unknown) subset $S \subseteq \R^d$. This core problem in truncated statistics has long history going back to Galton, Lee, Pearson and Fisher. Recent work by Daskalakis et al. (FOCS'18), provides the first efficient algorithm that works for arbitrary sets in high dimension when the set is known, but leaves as an open problem the more challenging and relevant case of unknown truncation set. Our main result is a computationally and sample efficient algorithm for estimating the parameters of the Gaussian under arbitrary unknown truncation sets whose performance decays with a natural measure of complexity of the set, namely its Gaussian surface area. Notably, this algorithm works for large families of sets including intersections of halfspaces, polynomial threshold functions and general convex sets. We show that our algorithm closely captures the tradeoff between the complexity of the set and the number of samples needed to learn the parameters by exhibiting a set with small Gaussian surface area for which it is information theoretically impossible to learn the true Gaussian with few samples.
Privately Answering Classification Queries in the Agnostic PAC Model
We revisit the problem of differentially private release of classification queries. In this problem, the goal is to design an algorithm that can accurately answer a sequence of classification queries based on a private training set while ensuring differential privacy. We formally study this problem in the agnostic PAC model and derive a new upper bound on the private sample complexity. Our results improve over those obtained in a recent work [BTT18] for the agnostic PAC setting. In particular, we give an improved construction that yields a tighter upper bound on the sample complexity. Moreover, unlike [BTT18], our accuracy guarantee does not involve any blow-up in the approximation error associated with the given hypothesis class. Given any hypothesis class with VC-dimension $d$, we show that our construction can privately answer up to $m$ classification queries with average excess error $\alpha$ using a private sample of size $\approx \frac{d}{\alpha^2}\max\left(1, \sqrt{m}\alpha^{3/2}\right)$. Using recent results on private learning with auxiliary public data, we extend our construction to show that one can privately answer any number of classification queries with average excess error $\alpha$ using a private sample of size $\approx \frac{d}{\alpha^2}\max\left(1, \sqrt{d} \alpha\right)$. Our results imply that when $\alpha$ is sufficiently small (high-accuracy regime), the private sample size is essentially the same as the non-private sample complexity of agnostic PAC learning.
Community Structure in Industrial SAT Instances
Ansótegui, Carlos, Bonet, Maria Luisa, Giráldez-Cru, Jesús, Levy, Jordi, Simon, Laurent
Modern SAT solvers have experienced a remarkable progress on solving industrial instances. Most of the techniques have been developed after an intensive experimental process. It is believed that these techniques exploit the underlying structure of industrial instances. However, there are few works trying to exactly characterize the main features of this structure. The research community on complex networks has developed techniques of analysis and algorithms to study real-world graphs that can be used by the SAT community. Recently, there have been some attempts to analyze the structure of industrial SAT instances in terms of complex networks, with the aim of explaining the success of SAT solving techniques, and possibly improving them. In this paper, inspired by the results on complex networks, we study the community structure, or modularity, of industrial SAT instances. In a graph with clear community structure, or high modularity, we can find a partition of its nodes into communities such that most edges connect variables of the same community. In our analysis, we represent SAT instances as graphs, and we show that most application benchmarks are characterized by a high modularity. On the contrary, random SAT instances are closer to the classical Erd\"os-R\'enyi random graph model, where no structure can be observed. We also analyze how this structure evolves by the effects of the execution of a CDCL SAT solver. In particular, we use the community structure to detect that new clauses learned by the solver during the search contribute to destroy the original structure of the formula. This is, learned clauses tend to contain variables of distinct communities.
A Profit Guided Coordination Heuristic for Travelling Thief Problems
Namazi, Majid (Griffith University) | Newton, M.A. Hakim (Griffith University) | Sattar, Abdul (Griffith University) | Sanderson, Conrad (Data61 / CSIRO)
The travelling thief problem (TTP) is a combination of two interdependent NP-hard components: travelling salesman problem (TSP) and knapsack problem (KP). Existing approaches for TTP typically solve the TSP and KP components in an interleaved fashion, where the solution to one component is held fixed while the other component is changed. This indicates poor coordination between solving the two components and may lead to poor quality TTP solutions. For solving the TSP component, the 2-OPT segment reversing heuristic is often used for modifying the tour. We propose an extended and modified form of the reversing heuristic in order to concurrently consider both the TSP and KP components. Items deemed as less profitable and picked in cities earlier in the reversed segment are replaced by items that tend to be equally or more profitable and not picked in the later cities. Comparative evaluations on a broad range of benchmark TTP instances indicate that the proposed approach outperforms existing state-of-the-art TTP solvers.
Identifying Linear Models in Multi-Resolution Population Data using Minimum Description Length Principle to Predict Household Income
Amornbunchornvej, Chainarong, Surasvadi, Navaporn, Plangprasopchok, Anon, Thajchayapong, Suttipong
One shirt size cannot fit everybody, while we cannot make a unique shirt that fits perfectly for everyone because of resource limitation. This analogy is true for the policy making. Policy makers cannot establish a single policy to solve all problems for all regions because each region has its own unique issue. In the other extreme, policy makers also cannot create a policy for each small village due to the resource limitation. Would it be better if we can find a set of largest regions such that the population of each region within this set has common issues and we can establish a single policy for them? In this work, we propose a framework using regression analysis and minimum description length (MDL) to find a set of largest areas that have common indicators, which can be used to predict household incomes efficiently. Given a set of household features, and a multi-resolution partition that represents administrative divisions, our framework reports a set C* of largest subdivisions that have a common model for population-income prediction. We formalize a problem of finding C* and propose the algorithm as a solution. We use both simulation datasets as well as a real-world dataset of Thailand's population household information to demonstrate our framework performance and application. The results show that our framework performance is better than the baseline methods. We show the results of our method can be used to find indicators of income prediction for many areas in Thailand. By increasing these indicator values, we expect people in these areas to gain more incomes. Hence, the policy makers can plan to establish the policies by using these indicators in our results as a guideline to solve low-income issues. Our framework can be used to support policy makers to establish policies regarding any other dependent variable beyond incomes in order to combat poverty and other issues.
Quantifying Confounding Bias in Neuroimaging Datasets with Causal Inference
Wachinger, Christian, Becker, Benjamin Gutierrez, Rieckmann, Anna, Pölsterl, Sebastian
Neuroimaging datasets keep growing in size to address increasingly complex medical questions. However, even the largest datasets today alone are too small for training complex machine learning models. A potential solution is to increase sample size by pooling scans from several datasets. In this work, we combine 12,207 MRI scans from 15 studies and show that simple pooling is often ill-advised due to introducing various types of biases in the training data. First, we systematically define these biases. Second, we detect bias by experimentally showing that scans can be correctly assigned to their respective dataset with 73.3% accuracy. Finally, we propose to tell causal from confounding factors by quantifying the extent of confounding and causality in a single dataset using causal inference. We achieve this by finding the simplest graphical model in terms of Kolmogorov complexity. As Kolmogorov complexity is not directly computable, we employ the minimum description length to approximate it. We empirically show that our approach is able to estimate plausible causal relationships from real neuroimaging data.
The Power of Comparisons for Actively Learning Linear Classifiers
Hopkins, Max, Kane, Daniel M., Lovett, Shachar
In recent years, the availability of big data and the high cost of labeling has lead to a surge of interest in active learning, an adaptive, semi-supervised learning paradigm. In traditional active learning, given an instance space X, a distribution D on X, and a class of concepts c: X {0, 1}, the learner receives unlabeled samples x from D with the ability to query an oracle for the labeling c(x). Classically our goal would be to minimize the number of samples the learner draws before approximately learning the concept class with high probability (PAClearning). Instead, active learning assumes unlabeled samples are inexpensive, and rather aims to minimize expensive queries to the oracle. While active learning requires exponentially fewer labeled samples than PAClearning for simple classes such as intervals and thresholds, it fails to provide asymptotic improvement for classes essential to machine learning such as linear separators [1]. However, recent results point to the fact that with slight relaxations or additions to the paradigm, such concept classes can be learned with exponentially fewer queries.