Goto

Collaborating Authors

 Genre


DeBaCl: A Python Package for Interactive DEnsity-BAsed CLustering

arXiv.org Machine Learning

The level set tree approach of Hartigan (1975) provides a probabilistically based and highly interpretable encoding of the clustering behavior of a dataset. By representing the hierarchy of data modes as a dendrogram of the level sets of a density estimator, this approach offers many advantages for exploratory analysis and clustering, especially for complex and high-dimensional data. Several R packages exist for level set tree estimation, but their practical usefulness is limited by computational inefficiency, absence of interactive graphical capabilities and, from a theoretical perspective, reliance on asymptotic approximations. To make it easier for practitioners to capture the advantages of level set trees, we have written the Python package DeBaCl for DEnsity-BAsed CLustering. In this article we illustrate how DeBaCl's level set tree estimates can be used for difficult clustering tasks and interactive graphical data analysis. The package is intended to promote the practical use of level set trees through improvements in computational efficiency and a high degree of user customization. In addition, the flexible algorithms implemented in DeBaCl enjoy finite sample accuracy, as demonstrated in recent literature on density clustering. Finally, we show the level set tree framework can be easily extended to deal with functional data. Keywords: density-based clustering, level set tree, Python, interactive graphics, functional data analysis.


Sharp Threshold for Multivariate Multi-Response Linear Regression via Block Regularized Lasso

arXiv.org Machine Learning

In this paper, we investigate a multivariate multi-response (MVMR) linear regression problem, which contains multiple linear regression models with differently distributed design matrices, and different regression and output vectors. The goal is to recover the support union of all regression vectors using $l_1/l_2$-regularized Lasso. We characterize sufficient and necessary conditions on sample complexity \emph{as a sharp threshold} to guarantee successful recovery of the support union. Namely, if the sample size is above the threshold, then $l_1/l_2$-regularized Lasso correctly recovers the support union; and if the sample size is below the threshold, $l_1/l_2$-regularized Lasso fails to recover the support union. In particular, the threshold precisely captures the impact of the sparsity of regression vectors and the statistical properties of the design matrices on sample complexity. Therefore, the threshold function also captures the advantages of joint support union recovery using multi-task Lasso over individual support recovery using single-task Lasso.


Likelihood-ratio calibration using prior-weighted proper scoring rules

arXiv.org Machine Learning

Prior-weighted logistic regression has become a standard tool for calibration in speaker recognition. Logistic regression is the optimization of the expected value of the logarithmic scoring rule. We generalize this via a parametric family of proper scoring rules. Our theoretical analysis shows how different members of this family induce different relative weightings over a spectrum of applications of which the decision thresholds range from low to high. Special attention is given to the interaction between prior weighting and proper scoring rule parameters. Experiments on NIST SRE'12 suggest that for applications with low false-alarm rate requirements, scoring rules tailored to emphasize higher score thresholds may give better accuracy than logistic regression.


Scalable $k$-NN graph construction

arXiv.org Machine Learning

The $k$-NN graph has played a central role in increasingly popular data-driven techniques for various learning and vision tasks; yet, finding an efficient and effective way to construct $k$-NN graphs remains a challenge, especially for large-scale high-dimensional data. In this paper, we propose a new approach to construct approximate $k$-NN graphs with emphasis in: efficiency and accuracy. We hierarchically and randomly divide the data points into subsets and build an exact neighborhood graph over each subset, achieving a base approximate neighborhood graph; we then repeat this process for several times to generate multiple neighborhood graphs, which are combined to yield a more accurate approximate neighborhood graph. Furthermore, we propose a neighborhood propagation scheme to further enhance the accuracy. We show both theoretical and empirical accuracy and efficiency of our approach to $k$-NN graph construction and demonstrate significant speed-up in dealing with large scale visual data.


POMDPs Make Better Hackers: Accounting for Uncertainty in Penetration Testing

arXiv.org Artificial Intelligence

Penetration Testing is a methodology for assessing network security, by generating and executing possible hacking attacks. Doing so automatically allows for regular and systematic testing. A key question is how to generate the attacks. This is naturally formulated as planning under uncertainty, i.e., under incomplete knowledge about the network configuration. Previous work uses classical planning, and requires costly pre-processes reducing this uncertainty by extensive application of scanning methods. By contrast, we herein model the attack planning problem in terms of partially observable Markov decision processes (POMDP). This allows to reason about the knowledge available, and to intelligently employ scanning actions as part of the attack. As one would expect, this accurate solution does not scale. We devise a method that relies on POMDPs to find good attacks on individual machines, which are then composed into an attack on the network as a whole. This decomposition exploits network structure to the extent possible, making targeted approximations (only) where needed. Evaluating this method on a suitably adapted industrial test suite, we demonstrate its effectiveness in both runtime and solution quality.


Tight Lower Bounds for Homology Inference

arXiv.org Machine Learning

The homology groups of a manifold are important topological invariants that provide an algebraic summary of the manifold. These groups contain rich topological information, for instance, about the connected components, holes, tunnels and sometimes the dimension of the manifold. In earlier work, we have considered the statistical problem of estimating the homology of a manifold from noiseless samples and from noisy samples under several different noise models. We derived upper and lower bounds on the minimax risk for this problem. In this note we revisit the noiseless case. In previous work we used Le Cam's lemma to establish a lower bound that differed from the upper bound of Niyogi, Smale and Weinberger by a polynomial factor in the condition number. In this note we use a different construction based on the direct analysis of the likelihood ratio test to show that the upper bound of Niyogi, Smale and Weinberger is in fact tight, thus establishing rate optimal asymptotic minimax bounds for the problem. The techniques we use here extend in a straightforward way to the noisy settings considered in our earlier work.


ReAct! An Interactive Tool for Hybrid Planning in Robotics

arXiv.org Artificial Intelligence

We present ReAct!, an interactive tool for high-level reasoning for cognitive robotic applications. ReAct! enables robotic researchers to describe robots' actions and change in dynamic domains, without having to know about the syntactic and semantic details of the underlying formalism in advance, and solve planning problems using state-of-the-art automated reasoners, without having to learn about their input/output language or usage. In particular, ReAct! can be used to represent sophisticated dynamic domains that feature concurrency, indirect effects of actions, and state/transition constraints. It allows for embedding externally defined calculations (e.g., checking for collision-free continuous trajectories) into representations of hybrid domains that require a tight integration of (discrete) high-level reasoning with (continuous) geometric reasoning. ReAct! also enables users to solve planning problems that involve complex goals. Such variety of utilities are useful for robotic researchers to work on interesting and challenging domains, ranging from service robotics to cognitive factories. ReAct! provides sample formalizations of some action domains (e.g., multi-agent path planning, Tower of Hanoi), as well as dynamic simulations of plans computed by a state-of-the-art automated reasoner (e.g., a SAT solver or an ASP solver).


Integration of 3D Object Recognition and Planning for Robotic Manipulation: A Preliminary Report

arXiv.org Artificial Intelligence

We investigate different approaches to integrating object recognition and planning in a tabletop manipulation domain with the set of objects used in the 2012 RoboCup@Work competition. Results of our preliminary experiments show that, with some approaches, close integration of perception and planning improves the quality of plans, as well as the computation times of feasible plans.


Levels of Integration between Low-Level Reasoning and Task Planning

arXiv.org Artificial Intelligence

We provide a systematic analysis of levels of integration between discrete high-level reasoning and continuous low-level reasoning to address hybrid planning problems in robotics. We identify four distinct strategies for such an integration: (i) low-level checks are done for all possible cases in advance and then this information is used during plan generation, (ii) low-level checks are done exactly when they are needed during the search for a plan, (iii) first all plans are computed and then infeasible ones are filtered, and (iv) by means of replanning, after finding a plan, low-level checks identify whether it is infeasible or not; if it is infeasible, a new plan is computed considering the results of previous lowlevel checks. We perform experiments on hybrid planning problems in robotic manipulation and legged locomotion domains considering these four methods of integration, as well as some of their combinations. We analyze the usefulness of levels of integration in these domains, both from the point of view of computational efficiency (in time and space) and from the point of view of plan quality relative to its feasibility. We discuss advantages and disadvantages of each strategy in the light of experimental results and provide some guidelines on choosing proper strategies for a given domain. Keywords: Task planning, geometric reasoning, answer set programming.


Borel Isomorphic Dimensionality Reduction of Data and Supervised Learning

arXiv.org Machine Learning

In this project we further investigate the idea of reducing the dimensionality of datasets using a Borel isomorphism with the purpose of subsequently applying supervised learning algorithms, as originally suggested by my supervisor V. Pestov (in 2011 Dagstuhl preprint). Any consistent learning algorithm, for example kNN, retains universal consistency after a Borel isomorphism is applied. A series of concrete examples of Borel isomorphisms that reduce the number of dimensions in a dataset is provided, based on multiplying the data by orthogonal matrices before the dimensionality reducing Borel isomorphism is applied. We test the accuracy of the resulting classifier in a lower dimensional space with various data sets. Working with a phoneme voice recognition dataset, of dimension 256 with 5 classes (phonemes), we show that a Borel isomorphic reduction to dimension 16 leads to a minimal drop in accuracy. In conclusion, we discuss further prospects of the method.