Goto

Collaborating Authors

 Search


Universal consistency and minimax rates for online Mondrian Forests

arXiv.org Machine Learning

We establish the consistency of an algorithm of Mondrian Forests, a randomized classification algorithm that can be implemented online. First, we amend the original Mondrian Forest algorithm, that considers a fixed lifetime parameter. Indeed, the fact that this parameter is fixed hinders the statistical consistency of the original procedure. Our modified Mondrian Forest algorithm grows trees with increasing lifetime parameters $\lambda_n$, and uses an alternative updating rule, allowing to work also in an online fashion. Second, we provide a theoretical analysis establishing simple conditions for consistency. Our theoretical analysis also exhibits a surprising fact: our algorithm achieves the minimax rate (optimal rate) for the estimation of a Lipschitz regression function, which is a strong extension of previous results to an arbitrary dimension.


What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)

arXiv.org Artificial Intelligence

Context: Topic modeling finds human-readable structures in unstructured textual data. A widely used topic modeler is Latent Dirichlet allocation. When run on different datasets, LDA suffers from "order effects" i.e. different topics are generated if the order of training data is shuffled. Such order effects introduce a systematic error for any study. This error can relate to misleading results;specifically, inaccurate topic descriptions and a reduction in the efficacy of text mining classification results. Objective: To provide a method in which distributions generated by LDA are more stable and can be used for further analysis. Method: We use LDADE, a search-based software engineering tool that tunes LDA's parameters using DE (Differential Evolution). LDADE is evaluated on data from a programmer information exchange site (Stackoverflow), title and abstract text of thousands ofSoftware Engineering (SE) papers, and software defect reports from NASA. Results were collected across different implementations of LDA (Python+Scikit-Learn, Scala+Spark); across different platforms (Linux, Macintosh) and for different kinds of LDAs (VEM,or using Gibbs sampling). Results were scored via topic stability and text mining classification accuracy. Results: In all treatments: (i) standard LDA exhibits very large topic instability; (ii) LDADE's tunings dramatically reduce cluster instability; (iii) LDADE also leads to improved performances for supervised as well as unsupervised learning. Conclusion: Due to topic instability, using standard LDA with its "off-the-shelf" settings should now be depreciated. Also, in future, we should require SE papers that use LDA to test and (if needed) mitigate LDA topic instability. Finally, LDADE is a candidate technology for effectively and efficiently reducing that instability.


Monte-Carlo Tree Search by Best Arm Identification

arXiv.org Machine Learning

We consider two-player zero-sum turn-based interactions, in which the sequence of possible successive moves is represented by a maximin game tree T. This tree models the possible actions sequences by a collection of MAX nodes, that correspond to states in the game in which player A should take action, MIN nodes, for states in the game in which player B should take action, and leaves which specify the payoff for player A. The goal is to determine the best action at the root for player A. For deterministic payoffs this search problem is primarily algorithmic, with several powerful pruning strategies available [20]. We look at problems with stochastic payoffs, which in addition present a major statistical challenge. Sequential identification questions in game trees with stochastic payoffs arise naturally as robust versions of bandit problems. They are also a core component of Monte Carlo tree search (MCTS) approaches for solving intractably large deterministic tree search problems, where an entire sub-tree is represented by a stochastic leaf in which randomized play-out and/or evaluations are performed [4]. A play-out consists in finishing the game with some simple, typically random, policy and observing the outcome for player A. For example, MCTS is used within the AlphaGo system [21], and the evaluation of a leaf position combines supervised learning and (smart) play-outs. While MCTS algorithms for Go have now reached expert human level, such algorithms remain very costly, in that many (expensive) leaf evaluations or play-outs are necessary to output the next action to be taken by the player. In this paper, we focus on the sample complexity of Monte-Carlo Tree Search methods, about which very little is known. For this purpose, we work under a simplified model for MCTS already studied by [22], and that generalizes the depth-two framework of [10].


Five Local Search Myths About Managing Your Local Data

@machinelearnbot

There are lots of different ways to manage your local listings. No matter which way you choose to do it, you're paying someone for their service -- same as you would, say, with your cable company. When you stop paying your cable company, you no longer receive their service. When you stop paying them, they stop doing the work.


Sampling and Reconstruction of Graph Signals via Weak Submodularity and Semidefinite Relaxation

arXiv.org Machine Learning

ABSTRACT We study the problem of sampling a bandlimited graph signal in the presence of noise, where the objective is to select a node subset of prescribed cardinality that minimizes the signal reconstruction mean squared error (MSE). To that end, we formulate the task at hand as the minimization of MSE subject to binary constraints, and approximate the resulting NPhard problem via semidefinite programming (SDP) relaxation. Moreover, we provide an alternative formulation based on maximizing a monotone weak submodular function and propose a randomized-greedy algorithm to find a sub-optimal subset. We then derive a worst-case performance guarantee on the MSE returned by the randomized greedy algorithm for general non-stationary graph signals. The efficacy of the proposed methods is illustrated through numerical simulations on synthetic and realworld graphs.


Generating Time-Based Label Refinements to Discover More Precise Process Models

arXiv.org Artificial Intelligence

Process mining is a research field focused on the analysis of event data with the aim of extracting insights related to dynamic behavior. Applying process mining techniques on data from smart home environments has the potential to provide valuable insights into (un)healthy habits and to contribute to ambient assisted living solutions. Finding the right event labels to enable the application of process mining techniques is however far from trivial, as simply using the triggering sensor as the label for sensor events results in uninformative models that allow for too much behavior (i.e., the models are overgeneralizing). Refinements of sensor level event labels suggested by domain experts have been shown to enable discovery of more precise and insightful process models. However, there exists no automated approach to generate refinements of event labels in the context of process mining. In this paper we propose a framework for the automated generation of label refinements based on the time attribute of events, allowing us to distinguish behaviourally different instances of the same event type based on their time attribute. We show on a case study with real-life smart home event data that using automatically generated refined labels in process discovery, we can find more specific, and therefore more insightful, process models. We observe that one label refinement could have an effect on the usefulness of other label refinements when used together. Therefore, we explore four strategies to generate useful combinations of multiple label refinements and evaluate those on three real-life smart home event logs.


Rubik s Cube records broken in 4.59 seconds

Daily Mail - Science & tech

A 23-year-old'professional speedcuber' has set a new world record by completing a Rubik's Cube in just 4.59 seconds. Korean SeungBeom Cho solved the 3D puzzle in his first round at the World Cube Association's ChicaGhosts 2017 event in Chicago, smashing his previous personal best of 6.54 seconds. Footage of Mr Cho's attempt shows him given just a few seconds to examine the cube before starting, completing it just moments later. A series of UK records have been broken by quick-fingered Rubik's Cube solvers at the UK championships held in Stevenage, Hertfordshire on Sunday. Competitors as young as seven tackled the notoriously tricky cubes one-handed, blindfolded and even with their feet in a bid to become the top gamers of the weekend.


On Hash-Based Work Distribution Methods for Parallel Best-First Search

Journal of Artificial Intelligence Research

Parallel best-first search algorithms such as Hash Distributed A* (HDA*) distribute work among the processes using a global hash function. We analyze the search and communication overheads of state-of-the-art hash-based parallel best-first search algorithms, and show that although Zobrist hashing, the standard hash function used by HDA*, achieves good load balance for many domains, it incurs significant communication overhead since almost all generated nodes are transferred to a different processor than their parents. We propose Abstract Zobrist hashing, a new work distribution method for parallel search which, instead of computing a hash value based on the raw features of a state, uses a feature projection function to generate a set of abstract features which results in a higher locality, resulting in reduced communications overhead. We show that Abstract Zobrist hashing outperforms previous methods on search domains using hand-coded, domain specific feature projection functions. We then propose GRAZHDA*, a graph-partitioning based approach to automatically generating feature projection functions. GRAZHDA* seeks to approximate the partitioning of the actual search space graph by partitioning the domain transition graph, an abstraction of the state space graph. We show that GRAZHDA* outperforms previous methods on domain-independent planning.


The Exact Solution to Rank-1 L1-norm TUCKER2 Decomposition

arXiv.org Machine Learning

We study rank-1 {L1-norm-based TUCKER2} (L1-TUCKER2) decomposition of 3-way tensors, treated as a collection of $N$ $D \times M$ matrices that are to be jointly decomposed. Our contributions are as follows. i) We prove that the problem is equivalent to combinatorial optimization over $N$ antipodal-binary variables. ii) We derive the first two algorithms in the literature for its exact solution. The first algorithm has cost exponential in $N$; the second one has cost polynomial in $N$ (under a mild assumption). Our algorithms are accompanied by formal complexity analysis. iii) We conduct numerical studies to compare the performance of exact L1-TUCKER2 (proposed) with standard HOSVD, HOOI, GLRAM, PCA, L1-PCA, and TPCA-L1. Our studies show that L1-TUCKER2 outperforms (in tensor approximation) all the above counterparts when the processed data are outlier corrupted.


Google auto-detects your whereabouts to get local search results

Engadget

The tech titan has moved away from relying on country-specific domains to serve up localized results on mobile web, the Google app for iOS, as well as Search and Maps for desktop. Now, your location dictates the kind of results you'll get -- you could go to google.com.au, for instance, but if you're in New Zealand, you'll still get search results tailored for your current whereabouts. You'll know the location Google recognizes by looking at the lower left-hand corner of the page, as you can see above. Google will automatically detect if you go to another country and serve you results for your new location. So, you'll get results tailored for Japan if you go there, but Google will seamlessly transition back to United States when you fly back home.