Goto

Collaborating Authors

 Computational Learning Theory


Nested-Greedy Search for Adversarial Real-Time Games

AAAI Conferences

Churchill and Buro (2013) launched a line of research through Portfolio Greedy Search (PGS), an algorithm for adversarial real-time planning that uses scripts to simplify the problem's action space. In this paper we present a problem in PGS's search scheme that has hitherto been overlooked. Namely, even under the strong assumption that PGS is able to evaluate all actions available to the player, PGS might fail to return the best action. We then describe an idealized algorithm that is guaranteed to return the best action and present an approximation of such algorithm, which we call Nested-Greedy Search (NGS). Empirical results on MicroRTS show that NGS is able to outperform PGS as well as state-of-the-art methods in matches played in small to medium-sized maps.


Degree-$d$ Chow Parameters Robustly Determine Degree-$d$ PTFs (and Algorithmic Applications)

arXiv.org Machine Learning

The degree-$d$ Chow parameters of a Boolean function $f: \{-1,1\}^n \to \mathbb{R}$ are its degree at most $d$ Fourier coefficients. It is well-known that degree-$d$ Chow parameters uniquely characterize degree-$d$ polynomial threshold functions (PTFs) within the space of all bounded functions. In this paper, we prove a robust version of this theorem: For $f$ any Boolean degree-$d$ PTF and $g$ any bounded function, if the degree-$d$ Chow parameters of $f$ are close to the degree-$d$ Chow parameters of $g$ in $\ell_2$-norm, then $f$ is close to $g$ in $\ell_1$-distance. Notably, our bound relating the two distances is completely independent of the dimension $n$. That is, we show that Boolean degree-$d$ PTFs are {\em robustly identifiable} from their degree-$d$ Chow parameters. Results of this form had been shown for the $d=1$ case~\cite{OS11:chow, DeDFS14}, but no non-trivial bound was previously known for $d >1$. Our robust identifiability result gives the following algorithmic applications: First, we show that Boolean degree-$d$ PTFs can be efficiently approximately reconstructed from approximations to their degree-$d$ Chow parameters. This immediately implies that degree-$d$ PTFs are efficiently learnable in the uniform distribution $d$-RFA model~\cite{BenDavidDichterman:98}. As a byproduct of our approach, we also obtain the first low integer-weight approximations of degree-$d$ PTFs, for $d>1$. As our second application, our robust identifiability result gives the first efficient algorithm, with dimension-independent error guarantees, for malicious learning of Boolean degree-$d$ PTFs under the uniform distribution.


Model Extraction and Active Learning

arXiv.org Machine Learning

Machine learning is being increasingly used by individuals, research institutions, and corporations. This has resulted in the surge of Machine Learning-as-a-Service (MLaaS) - cloud services that provide (a) tools and resources to learn the model, and (b) a user-friendly query interface to access the model. However, such MLaaS systems raise privacy concerns, one being model extraction. Adversaries maliciously exploit the query interface to steal the model. More precisely, in a model extraction attack, a good approximation of a sensitive or proprietary model held by the server is extracted (i.e. learned) by a dishonest user. Such a user only sees the answers to select queries sent using the query interface. This attack was recently introduced by Tramer et al. at the 2016 USENIX Security Symposium, where practical attacks for different models were shown. We believe that better understanding the efficacy of model extraction attacks is paramount in designing better privacy-preserving MLaaS systems. To that end, we take the first step by (a) formalizing model extraction and proposing the first definition of extraction defense, and (b) drawing parallels between model extraction and the better investigated active learning framework. In particular, we show that recent advancements in the active learning domain can be used to implement both model extraction, and defenses against such attacks.


A tutorial on MDL hypothesis testing for graph analysis

arXiv.org Machine Learning

When analysing graph structure, it can be difficult to determine whether patterns found are due to chance, or due to structural aspects of the process that generated the data. Hypothesis tests are often used to support such analyses. These allow us to make statistical inferences about which null models are responsible for the data, and they can be used as a heuristic in searching for meaningful patterns. The minimum description length (MDL) principle [6, 4] allows us to build such hypothesis tests, based on efficient descriptions of the data. Broadly: we translate the regularity we are interested in into a code for the data, and if this code describes the data more efficiently than a code corresponding to the null model, by a sufficient margin, we may reject the null model. This is a frequentist approach to MDL, based on hypothesis testing. Bayesian approaches to MDL for model selection rather than model rejection are more common, but for the purposes of pattern analysis, a hypothesis testing approach provides a more natural fit with existing literature. 1 We provide a brief illustration of this principle based on the running example of analysing the size of the largest clique in a graph. We illustrate how a code can be constructed to efficiently represent graphs with large cliques, and how the description length of the data under this code can be compared to the description length under a code corresponding to a null model to show that the null model is highly unlikely to have generated the data.


On Statistical Learning of Simplices: Unmixing Problem Revisited

arXiv.org Machine Learning

Learning of high-dimensional simplices from uniformly-sampled observations, generally known as the "unmixing problem", is a long-studied task in computer science. More recently, a significant interest is focused on this problem from other areas, such as computational biology and remote sensing. In this paper, we have studied the Probably Approximately Correct (PAC)-learnability of simplices with a focus on sample complexity. Our analysis shows that a sufficient sample size for PAC-learning of $K$-simplices is only $O\left(K^2\log K\right)$, yielding a huge improvement over the existing results, i.e. $O\left(K^{22}\right)$. Moreover, a novel continuously-relaxed optimization scheme is proposed which is guaranteed to achieve a PAC-approximation of the simplex, followed by a corresponding scalable algorithm whose performance is extensively tested on synthetic and real-world datasets. Experimental results show that not only being comparable to other existing strategies on noiseless samples, our method is superior to the state-of-the-art in noisy cases. The overall proposed framework is backed with solid theoretical guarantees and provides a rigorous framework for future research in this area.


Semi-supervised clustering for de-duplication

arXiv.org Machine Learning

Data de-duplication is the task of detecting multiple records that correspond to the same real-world entity in a database. In this work, we view de-duplication as a clustering problem where the goal is to put records corresponding to the same physical entity in the same cluster and putting records corresponding to different physical entities into different clusters. We introduce a framework which we call promise correlation clustering. Given a complete graph $G$ with the edges labelled $0$ and $1$, the goal is to find a clustering that minimizes the number of $0$ edges within a cluster plus the number of $1$ edges across different clusters (or correlation loss). The optimal clustering can also be viewed as a complete graph $G^*$ with edges corresponding to points in the same cluster being labelled $0$ and other edges being labelled $1$. Under the promise that the edge difference between $G$ and $G^*$ is "small", we prove that finding the optimal clustering (or $G^*$) is still NP-Hard. [Ashtiani et. al, 2016] introduced the framework of semi-supervised clustering, where the learning algorithm has access to an oracle, which answers whether two points belong to the same or different clusters. We further prove that even with access to a same-cluster oracle, the promise version is NP-Hard as long as the number queries to the oracle is not too large ($o(n)$ where $n$ is the number of vertices). Given these negative results, we consider a restricted version of correlation clustering. As before, the goal is to find a clustering that minimizes the correlation loss. However, we restrict ourselves to a given class $\mathcal F$ of clusterings. We offer a semi-supervised algorithmic approach to solve the restricted variant with success guarantees.


Spatio-temporal Edge Service Placement: A Bandit Learning Approach

arXiv.org Artificial Intelligence

Shared edge computing platforms deployed at the radio access network are expected to significantly improve quality of service delivered by Application Service Providers (ASPs) in a flexible and economic way. However, placing edge service in every possible edge site by an ASP is practically infeasible due to the ASP's prohibitive budget requirement. In this paper, we investigate the edge service placement problem of an ASP under a limited budget, where the ASP dynamically rents computing/storage resources in edge sites to host its applications in close proximity to end users. Since the benefit of placing edge service in a specific site is usually unknown to the ASP a priori, optimal placement decisions must be made while learning this benefit. We pose this problem as a novel combinatorial contextual bandit learning problem. It is "combinatorial" because only a limited number of edge sites can be rented to provide the edge service given the ASP's budget. It is "contextual" because we utilize user context information to enable finer-grained learning and decision making. To solve this problem and optimize the edge computing performance, we propose SEEN, a Spatial-temporal Edge sErvice placemeNt algorithm. Furthermore, SEEN is extended to scenarios with overlapping service coverage by incorporating a disjunctively constrained knapsack problem. In both cases, we prove that our algorithm achieves a sublinear regret bound when it is compared to an oracle algorithm that knows the exact benefit information. Simulations are carried out on a real-world dataset, whose results show that SEEN significantly outperforms benchmark solutions. Mobile cloud computing (MCC) supports mobile applications in resource-constrained mobile devices by offloading computation-demanding tasks to the resource-rich remote cloud. L. Chen and J. Xu are with Department of Electrical and Computer Engineering, University of Miami, USA. S. Ren is with Department of Electrical and Computer Engineering, University of California, Riverside, USA.


r/MachineLearning - [Research] Help relating to a theorem in machine learning

#artificialintelligence

This is related to a theorem that I have proved and its relation (or not) to an existing result. Essentially, I have shown that PAC-learning is undecidable in the Turing sense. The arxiv link to the paper is https://arxiv.org/abs/1808.06324 I am told that this is provable as a corollary of existing results. I was hinted that the fundamental theorem of statistical machine learning that relates the VC dimension and PAC-learning could be used to prove the undecidability of PAC-learning.


When is there a Representer Theorem? Reflexive Banach spaces

arXiv.org Machine Learning

We consider a general regularised interpolation problem for learning a parameter vector from data. The well known representer theorem says that under certain conditions on the regulariser there exists a solution in the linear span of the data points. This is the core of kernel methods in machine learning as it makes the problem computationally tractable. Most literature deals only with sufficient conditions for representer theorems in Hilbert spaces. We prove necessary and sufficient conditions for the existence of representer theorems in reflexive Banach spaces and illustrate why in a sense reflexivity is the minimal requirement on the function space. We further show that if the learning relies on the linear representer theorem the solution is independent of the regulariser and in fact determined by the function space alone. This in particular shows the value of generalising Hilbert space learning theory to Banach spaces.


Learning without Interaction Requires Separation

arXiv.org Machine Learning

One of the key resources in large-scale learning systems is the number of rounds of communication between the server and the clients holding the data points. We study this resource for systems with two types of constraints on the communication from each of the clients: local differential privacy and limited number of bits communicated. For both models the number of rounds of communications is captured by the number of rounds of interaction when solving the learning problem in the statistical query (SQ) model. For many learning problems known efficient algorithms require many rounds of interaction. Yet little is known on whether this is actually necessary. In the context of classification in the PAC learning model, Kasiviswanathan et al. (2008) constructed an artificial class of functions that is PAC learnable with respect to a fixed distribution but cannot be learned by an efficient non-interactive (or one-round) SQ algorithm. Here we show that a similar separation holds for learning linear separators and decision lists without assumptions on the distribution. To prove this separation we show that non-interactive SQ algorithms can only learn function classes of low margin complexity, that is classes of functions that can be represented as large-margin linear separators.