Clustering
(Almost) All of Entity Resolution
Binette, Olivier, Steorts, Rebecca C.
Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as record linkage, de-duplication, or entity resolution. In this article, we review motivational applications and seminal papers that have led to the growth of this area. Specifically, we review the foundational work that began in the 1940's and 50's that have led to modern probabilistic record linkage. We review clustering approaches to entity resolution, semi- and fully supervised methods, and canonicalization, which are being used throughout industry and academia in applications such as human rights, official statistics, medicine, citation networks, among others. Finally, we discuss current research topics of practical importance.
Community models for partially observed networks from surveys
Li, Tianxi, Levina, Elizaveta, Zhu, Ji
Communities are a common and widely studied structure in networks, typically under the assumption that the network is fully and correctly observed. In practice, network data are often collected through sampling schemes such as surveys. These sampling mechanisms introduce noise and bias which can obscure the community structure and invalidate assumptions underlying standard community detection methods. We propose a general model for a class of network sampling mechanisms based on survey designs, designed to enable more accurate community detection for network data collected in this fashion. We model edge sampling probabilities as a function of both individual preferences and community parameters, and show community detection can be done by spectral clustering under this general class of models. We also propose, as a special case of the general framework, a parametric model for directed networks we call the nomination stochastic block model, which allows for meaningful parameter interpretations and can be fitted by the method of moments. Both spectral clustering and the method of moments in this case are computationally efficient and come with theoretical guarantees of consistency. We evaluate the proposed model in simulation studies on both unweighted and weighted networks and apply it to a faculty hiring dataset, discovering a meaningful hierarchy of communities among US business schools.
A Novel Community Detection Based Genetic Algorithm for Feature Selection
Rostami, Mehrdad, Berahmand, Kamal, Forouzandeh, Saman
The selection of features is an essential data preprocessing stage in data mining. The core principle of feature selection seems to be to pick a subset of possible features by excluding features with almost no predictive information as well as highly associated redundant features. In the past several years, a variety of meta-heuristic methods were introduced to eliminate redundant and irrelevant features as much as possible from high-dimensional datasets. Among the main disadvantages of present meta-heuristic based approaches is that they are often neglecting the correlation between a set of selected features. In this article, for the purpose of feature selection, the authors propose a genetic algorithm based on community detection, which functions in three steps. The feature similarities are calculated in the first step. The features are classified by community detection algorithms into clusters throughout the second step. In the third step, features are picked by a genetic algorithm with a new community-based repair operation. Nine benchmark classification problems were analyzed in terms of the performance of the presented approach. Also, the authors have compared the efficiency of the proposed approach with the findings from four available algorithms for feature selection. The findings indicate that the new approach continuously yields improved classification accuracy.
Nystr\"om Approximation with Nonnegative Matrix Factorization
Motivated by the needs of estimating the proximity clustering with partial distance measurements from vantage points or landmarks for remote networked systems, we show that the proximity clustering problem can be effectively formulated as the Nystr\"om approximation problem, which solves the kernel K-means clustering problem in the complex space. We implement the Nystr\"om approximation based on a landmark based Nonnegative Matrix Factorization (NMF) process. Evaluation results show that the proposed method finds nearly optimal clustering quality on both synthetic and real-world data sets as we vary the range of parameter choices and network conditions.
On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice
Machine learning algorithms have been used widely in various applications and areas. To fit a machine learning model into different problems, its hyper-parameters must be tuned. Selecting the best hyper-parameter configuration for machine learning models has a direct impact on the model's performance. It often requires deep knowledge of machine learning algorithms and appropriate hyper-parameter optimization techniques. Although several automatic optimization techniques exist, they have different strengths and drawbacks when applied to different types of problems. In this paper, optimizing the hyper-parameters of common machine learning models is studied. We introduce several state-of-the-art optimization techniques and discuss how to apply them to machine learning algorithms. Many available libraries and frameworks developed for hyper-parameter optimization problems are provided, and some open challenges of hyper-parameter optimization research are also discussed in this paper. Moreover, experiments are conducted on benchmark datasets to compare the performance of different optimization methods and provide practical examples of hyper-parameter optimization. This survey paper will help industrial users, data analysts, and researchers to better develop machine learning models by identifying the proper hyper-parameter configurations effectively.
Real Time Anomaly Detection for Cognitive Intelligence - XenonStack
Classical Analytics – Around ten years ago, the tools for analytics or the available resources were excel, SQL databases, and similar relatively simple ones when compared to the advanced ones that are available nowadays. The analytics also used to target things like reporting, customer classification, sales trend whether they are going up or down, etc.In this article we will discuss about Real Time Anomaly Detection. As time passed by the amount of data has got a revolutionary explosion with various factors like social media data, transaction records, sensor information, etc. in the past five years. With the increase of data, how data is stored has also changed. It used to be SQL databases the most and analytics used to happen for the same during the ideal time. The analytics also used to be serialized. Later, NoSQL databases started to replace the traditional SQL databases since the data size has become huge and the analysis also changed from serial analytics to parallel processing and distributed systems for quick results.
QUBO Formulations for Training Machine Learning Models
Date, Prasanna, Arthur, Davis, Pusey-Nazzaro, Lauren
Training machine learning models on classical computers is usually a time and compute intensive process. With Moore's law coming to an end and ever increasing demand for large-scale data analysis using machine learning, we must leverage non-conventional computing paradigms like quantum computing to train machine learning models efficiently. Adiabatic quantum computers like the D-Wave 2000Q can approximately solve NP-hard optimization problems, such as the quadratic unconstrained binary optimization (QUBO), faster than classical computers. Since many machine learning problems are also NP-hard, we believe adiabatic quantum computers might be instrumental in training machine learning models efficiently in the post Moore's law era. In order to solve a problem on adiabatic quantum computers, it must be formulated as a QUBO problem, which is a challenging task in itself. In this paper, we formulate the training problems of three machine learning models---linear regression, support vector machine (SVM) and equal-sized k-means clustering---as QUBO problems so that they can be trained on adiabatic quantum computers efficiently. We also analyze the time and space complexities of our formulations and compare them to the state-of-the-art classical algorithms for training these machine learning models. We show that the time and space complexities of our formulations are better (in the case of SVM and equal-sized k-means clustering) or equivalent (in case of linear regression) to their classical counterparts.
Efficient Optimization of Dominant Set Clustering with Frank-Wolfe Algorithms
Johnell, Carl, Chehreghani, Morteza Haghir
We study Frank-Wolfe algorithms -- standard, pairwise, and away-steps -- for efficient optimization of Dominant Set Clustering. We present a unified and computationally efficient framework to employ the different variants of Frank-Wolfe methods, and we investigate its effectiveness via several experimental studies. In addition, we provide explicit convergence rates for the algorithms in terms of the so-called Frank-Wolfe gap. The theoretical analysis has been specialized to the problem of Dominant Set Clustering and is thus more easily accessible compared to prior work.
Space-filling Curves for High-performance Data Mining
Space-filling curves like the Hilbert-curve, Peano-curve and Z-order map natural or real numbers from a two or higher dimensional space to a one dimensional space preserving locality. They have numerous applications like search structures, computer graphics, numerical simulation, cryptographics and can be used to make various algorithms cache-oblivious. In this paper, we describe some details of the Hilbert-curve. We define the Hilbert-curve in terms of a finite automaton of Mealy-type which determines from the two-dimensional coordinate space the Hilbert order value and vice versa in a logarithmic number of steps. And we define a context-free grammar to generate the whole curve in a time which is linear in the number of generated coordinate/order value pairs, i.e. a constant time per coordinate pair or order value. We also review two different strategies which enable the generation of curves without the usual restriction to square-like grids where the side-length is a power of two. Finally, we elaborate on a few applications, namely matrix multiplication, Cholesky decomposition, the Floyd-Warshall algorithm, k-Means clustering, and the similarity join.
Sketching Datasets for Large-Scale Learning (long version)
Gribonval, Rémi, Chatalic, Antoine, Keriven, Nicolas, Schellekens, Vincent, Jacques, Laurent, Schniter, Philip
This article considers "sketched learning," or "compressive learning," an approach to large-scale machine learning where datasets are massively compressed before learning (e.g., clustering, classification, or regression) is performed. In particular, a "sketch" is first constructed by computing carefully chosen nonlinear random features (e.g., random Fourier features) and averaging them over the whole dataset. Parameters are then learned from the sketch, without access to the original dataset. This article surveys the current state-of-the-art in sketched learning, including the main concepts and algorithms, their connections with established signal-processing methods, existing theoretical guarantees---on both information preservation and privacy preservation, and important open problems.