Search
Needle in a Haystack: A Nifty Large-Scale Text Search Algorithm Tutorial
When coming across the term "text search", one usually thinks of a large body of text, which is indexed in a way that makes it possible to quickly look up one or more search terms when they are entered by a user. This is a classic problem for computer scientists, to which many solutions exist. What if what's available for indexing beforehand is a group of search phrases, and only at runtime is a large body of text presented for searching? These questions are what this trie data structure tutorial seeks to address. A real world application for this scenario is matching a number of medical theses against a list of medical conditions and finding out which theses discuss which conditions.
Scalable Greedy Feature Selection via Weak Submodularity
Khanna, Rajiv, Elenberg, Ethan, Dimakis, Alexandros G., Negahban, Sahand, Ghosh, Joydeep
Greedy algorithms are widely used for problems in machine learning such as feature selection and set function optimization. Unfortunately, for large datasets, the running time of even greedy algorithms can be quite high. This is because for each greedy step we need to refit a model or calculate a function using the previously selected choices and the new candidate. Two algorithms that are faster approximations to the greedy forward selection were introduced recently ([Mirzasoleiman et al. 2013, 2015]). They achieve better performance by exploiting distributed computation and stochastic evaluation respectively. Both algorithms have provable performance guarantees for submodular functions. In this paper we show that divergent from previously held opinion, submodularity is not required to obtain approximation guarantees for these two algorithms. Specifically, we show that a generalized concept of weak submodularity suffices to give multiplicative approximation guarantees. Our result extends the applicability of these algorithms to a larger class of functions. Furthermore, we show that a bounded submodularity ratio can be used to provide data dependent bounds that can sometimes be tighter also for submodular functions. We empirically validate our work by showing superior performance of fast greedy approximations versus several established baselines on artificial and real datasets.
Cost-Optimal Learning of Causal Graphs
Kocaoglu, Murat, Dimakis, Alexandros G., Vishwanath, Sriram
We consider the problem of learning a causal graph over a set of variables with interventions. We study the cost-optimal causal graph learning problem: For a given skeleton (undirected version of the causal graph), design the set of interventions with minimum total cost, that can uniquely identify any causal graph with the given skeleton. We show that this problem is solvable in polynomial time. Later, we consider the case when the number of interventions is limited. For this case, we provide polynomial time algorithms when the skeleton is a tree or a clique tree. For a general chordal skeleton, we develop an efficient greedy algorithm, which can be improved when the causal graph skeleton is an interval graph.
Record-breaking robot solves Rubik's cube in 0.637 SECONDS
The Rubik's cube was devised by Hungarian architect Erno Rubik more than 30 years ago, but he likely never envisioned his puzzle being cracked this quickly. The machine, known as'Sub1 Reloaded' and developed by German tech company Infineon, was aided by one of the world's most powerful microcomputers, solved a Rubik's cube in 0.637 seconds at the Electronica Trade Fair in Munich, Germany earlier this year. The machine, known as'Sub1 Reloaded' and developed by German tech company Infineon, was aided by one of the world's most powerful microcomputers'Guinness World Records has spent some time carefully reviewing the evidence, including ensuring that the cube and the pre-scrambling met all WCA standards, before confirming the new record today,' the organisation said. The robot took a fraction of a second to analyse the cube and make 21 moves to solve the puzzle. Its time of 0.637 seconds beat the previous world record of 0.887 seconds, set by an earlier prototype of the same machine.
Red Blob Games: Introduction to A*
The first thing to do when studying an algorithm is to understand the data. Input: Graph search algorithms, including A*, take a "graph" as input. A graph is a set of locations ("nodes") and the connections ("edges") between them. Here's the graph I gave to A*: A* doesn't see anything else. It only sees the graph.
Optimal Experiment Design for Causal Discovery from Fixed Number of Experiments
Ghassami, AmirEmad, Salehkaleybar, Saber, Kiyavash, Negar
We study the problem of causal structure learning over a set of random variables when the experimenter is allowed to perform at most $M$ experiments in a non-adaptive manner. We consider the optimal learning strategy in terms of minimizing the portions of the structure that remains unknown given the limited number of experiments in both Bayesian and minimax setting. We characterize the theoretical optimal solution and propose an algorithm, which designs the experiments efficiently in terms of time complexity. We show that for bounded degree graphs, in the minimax case and in the Bayesian case with uniform priors, our proposed algorithm is a $\rho$-approximation algorithm, where $\rho$ is independent of the order of the underlying graph. Simulations on both synthetic and real data show that the performance of our algorithm is very close to the optimal solution.
Distance-Penalized Active Learning Using Quantile Search
Lipor, John, Wong, Brandon, Scavia, Donald, Kerkez, Branko, Balzano, Laura
Adaptive sampling theory has shown that, with proper assumptions on the signal class, algorithms exist to reconstruct a signal in $\mathbb{R}^{d}$ with an optimal number of samples. We generalize this problem to the case of spatial signals, where the sampling cost is a function of both the number of samples taken and the distance traveled during estimation. This is motivated by our work studying regions of low oxygen concentration in the Great Lakes. We show that for one-dimensional threshold classifiers, a tradeoff between the number of samples taken and distance traveled can be achieved using a generalization of binary search, which we refer to as quantile search. We characterize both the estimation error after a fixed number of samples and the distance traveled in the noiseless case, as well as the estimation error in the case of noisy measurements. We illustrate our results in both simulations and experiments and show that our method outperforms existing algorithms in the majority of practical scenarios.
Best-First Width Search: Exploration and Exploitation in Classical Planning
Lipovetzky, Nir (University of Melbourne) | Geffner, Hector (ICREA and Universitat Pompeu Fabra)
It has been shown recently that the performance of greedy best-first search (GBFS) for computing plans that are not necessarily optimal can be improved by adding forms of exploration when reaching heuristic plateaus: from random walks to local GBFS searches. In this work, we address this problem but using structural exploration methods resulting from the ideas of width-based search. Width-based methodsseek novel states, are not goal oriented, and their power has been shown recently in the Atari and GVG-AI video-games. We show first that width-based exploration in GBFS is more effective than GBFS with local GBFS search (GBFS-LS), and then proceed to formulate a simple and general computational framework where standard goal-oriented search (exploitation) and width-based search (structural exploration) are combined to yield a search scheme, best-first width search, that is better than both and which results in classical planning algorithms that outperform the state-of-the-art planners.
Local Search for Minimum Weight Dominating Set with Two-Level Configuration Checking and Frequency Based Scoring Function
Wang, Yiyuan, Cai, Shaowei, Yin, Minghao
The Minimum Weight Dominating Set (MWDS) problem is an important generalization of the Minimum Dominating Set (MDS) problem with extensive applications. This paper proposes a new local search algorithm for the MWDS problem, which is based on two new ideas. The first idea is a heuristic called two-level configuration checking (CC2), which is a new variant of a recent powerful configuration checking strategy (CC) for effectively avoiding the recent search paths. The second idea is a novel scoring function based on the frequency of being uncovered of vertices. Our algorithm is called CC2FS, according to the names of the two ideas. The experimental results show that, CC2FS performs much better than some state-of-the-art algorithms in terms of solution quality on a broad range of MWDS benchmarks.
Classification with Minimax Distance Measures
Chehreghani, Morteza Haghir (Xerox Research Centre Europe)
Minimax distance measures provide an effective way to capture the unknown underlying patterns and classes of the data in a non-parametric way. We develop a general-purpose framework to employ Minimax distances with any classification method that performs on numerical data. For this purpose, we establish a two-step strategy. First, we compute the pairwise Minimax distances between the objects, using the equivalence of Minimax distances over a graph and over a minimum spanning tree constructed on that. Then, we perform an embedding of the pairwise Minimax distances into a new vector space, such that their squared Euclidean distances in the new space are equal to their Minimax distances in the original space. We also consider the cases where multiple pairwise Minimax matrices are given, instead of a single one. Thereby, we propose an embedding via first summing up the centered matrices and then performing an eigenvalue decomposition. We experimentally validate our framework on different synthetic and real-world datasets.