AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

(Almost) All of Entity Resolution

Binette, Olivier, Steorts, Rebecca C.

arXiv.org Machine LearningAug-10-2020

Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as record linkage, de-duplication, or entity resolution. In this article, we review motivational applications and seminal papers that have led to the growth of this area. Specifically, we review the foundational work that began in the 1940's and 50's that have led to modern probabilistic record linkage. We review clustering approaches to entity resolution, semi- and fully supervised methods, and canonicalization, which are being used throughout industry and academia in applications such as human rights, official statistics, medicine, citation networks, among others. Finally, we discuss current research topics of practical importance.

entity resolution, information retrieval, machine learning, (14 more...)

arXiv.org Machine Learning

2008.04443

Country:

Asia > Middle East > Syria (0.28)
North America > United States > North Carolina > Durham County > Durham (0.14)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
(23 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.67)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government > Voting & Elections (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
(4 more...)

Add feedback

Community models for partially observed networks from surveys

Li, Tianxi, Levina, Elizaveta, Zhu, Ji

arXiv.org Machine LearningAug-9-2020

Communities are a common and widely studied structure in networks, typically under the assumption that the network is fully and correctly observed. In practice, network data are often collected through sampling schemes such as surveys. These sampling mechanisms introduce noise and bias which can obscure the community structure and invalidate assumptions underlying standard community detection methods. We propose a general model for a class of network sampling mechanisms based on survey designs, designed to enable more accurate community detection for network data collected in this fashion. We model edge sampling probabilities as a function of both individual preferences and community parameters, and show community detection can be done by spectral clustering under this general class of models. We also propose, as a special case of the general framework, a parametric model for directed networks we call the nomination stochastic block model, which allows for meaningful parameter interpretations and can be fitted by the method of moments. Both spectral clustering and the method of moments in this case are computationally efficient and come with theoretical guarantees of consistency. We evaluate the proposed model in simulation studies on both unweighted and weighted networks and apply it to a faculty hiring dataset, discovering a meaningful hierarchy of communities among US business schools.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Machine Learning

2008.03652

Country:

North America > United States > California (0.14)
North America > United States > Michigan (0.05)
North America > United States > Virginia (0.04)
(21 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine (0.88)
Education > Educational Setting > Higher Education (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

A Novel Community Detection Based Genetic Algorithm for Feature Selection

Rostami, Mehrdad, Berahmand, Kamal, Forouzandeh, Saman

arXiv.org Machine LearningAug-8-2020

The selection of features is an essential data preprocessing stage in data mining. The core principle of feature selection seems to be to pick a subset of possible features by excluding features with almost no predictive information as well as highly associated redundant features. In the past several years, a variety of meta-heuristic methods were introduced to eliminate redundant and irrelevant features as much as possible from high-dimensional datasets. Among the main disadvantages of present meta-heuristic based approaches is that they are often neglecting the correlation between a set of selected features. In this article, for the purpose of feature selection, the authors propose a genetic algorithm based on community detection, which functions in three steps. The feature similarities are calculated in the first step. The features are classified by community detection algorithms into clusters throughout the second step. In the third step, features are picked by a genetic algorithm with a new community-based repair operation. Nine benchmark classification problems were analyzed in terms of the performance of the presented approach. Also, the authors have compared the efficiency of the proposed approach with the findings from four available algorithms for feature selection. The findings indicate that the new approach continuously yields improved classification accuracy.

evolutionary algorithm, machine learning, selection, (15 more...)

arXiv.org Machine Learning

2008.03543

Country:

Oceania > New Zealand > North Island > Waikato (0.04)
Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
Oceania > Australia > Queensland > Brisbane (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.36)

Add feedback

Nystr\"om Approximation with Nonnegative Matrix Factorization

Fu, Yongquan

arXiv.org Machine LearningAug-7-2020

Motivated by the needs of estimating the proximity clustering with partial distance measurements from vantage points or landmarks for remote networked systems, we show that the proximity clustering problem can be effectively formulated as the Nystr\"om approximation problem, which solves the kernel K-means clustering problem in the complex space. We implement the Nystr\"om approximation based on a landmark based Nonnegative Matrix Factorization (NMF) process. Evaluation results show that the proposed method finds nearly optimal clustering quality on both synthetic and real-world data sets as we vary the range of parameter choices and network conditions.

artificial intelligence, machine learning, matrix, (16 more...)

arXiv.org Machine Learning

2008.03399

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > South Korea (0.04)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice

Yang, Li, Shami, Abdallah

arXiv.org Machine LearningAug-7-2020

Machine learning algorithms have been used widely in various applications and areas. To fit a machine learning model into different problems, its hyper-parameters must be tuned. Selecting the best hyper-parameter configuration for machine learning models has a direct impact on the model's performance. It often requires deep knowledge of machine learning algorithms and appropriate hyper-parameter optimization techniques. Although several automatic optimization techniques exist, they have different strengths and drawbacks when applied to different types of problems. In this paper, optimizing the hyper-parameters of common machine learning models is studied. We introduce several state-of-the-art optimization techniques and discuss how to apply them to machine learning algorithms. Many available libraries and frameworks developed for hyper-parameter optimization problems are provided, and some open challenges of hyper-parameter optimization research are also discussed in this paper. Moreover, experiments are conducted on benchmark datasets to compare the performance of different optimization methods and provide practical examples of hyper-parameter optimization. This survey paper will help industrial users, data analysts, and researchers to better develop machine learning models by identifying the proper hyper-parameter configurations effectively.

artificial intelligence, bayesian inference, machine learning, (20 more...)

arXiv.org Machine Learning

doi: 10.1016/j.neucom.2020.07.061

2007.15745

Country:

North America > Canada > Ontario > Middlesex County > London (0.14)
Asia > Middle East > Jordan (0.04)
Asia > China > Hubei Province > Wuhan (0.04)
(5 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.46)

Industry:

Education (0.92)
Information Technology > Security & Privacy (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(4 more...)

Add feedback

Real Time Anomaly Detection for Cognitive Intelligence - XenonStack

#artificialintelligenceAug-5-2020, 09:26:41 GMT

Classical Analytics – Around ten years ago, the tools for analytics or the available resources were excel, SQL databases, and similar relatively simple ones when compared to the advanced ones that are available nowadays. The analytics also used to target things like reporting, customer classification, sales trend whether they are going up or down, etc.In this article we will discuss about Real Time Anomaly Detection. As time passed by the amount of data has got a revolutionary explosion with various factors like social media data, transaction records, sensor information, etc. in the past five years. With the increase of data, how data is stored has also changed. It used to be SQL databases the most and analytics used to happen for the same during the ideal time. The analytics also used to be serialized. Later, NoSQL databases started to replace the traditional SQL databases since the data size has become huge and the analysis also changed from serial analytics to parallel processing and distributed systems for quick results.

anomaly, data mining, machine learning, (13 more...)

#artificialintelligence

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

QUBO Formulations for Training Machine Learning Models

Date, Prasanna, Arthur, Davis, Pusey-Nazzaro, Lauren

arXiv.org Machine LearningAug-5-2020

Training machine learning models on classical computers is usually a time and compute intensive process. With Moore's law coming to an end and ever increasing demand for large-scale data analysis using machine learning, we must leverage non-conventional computing paradigms like quantum computing to train machine learning models efficiently. Adiabatic quantum computers like the D-Wave 2000Q can approximately solve NP-hard optimization problems, such as the quadratic unconstrained binary optimization (QUBO), faster than classical computers. Since many machine learning problems are also NP-hard, we believe adiabatic quantum computers might be instrumental in training machine learning models efficiently in the post Moore's law era. In order to solve a problem on adiabatic quantum computers, it must be formulated as a QUBO problem, which is a challenging task in itself. In this paper, we formulate the training problems of three machine learning models---linear regression, support vector machine (SVM) and equal-sized k-means clustering---as QUBO problems so that they can be trained on adiabatic quantum computers efficiently. We also analyze the time and space complexities of our formulations and compare them to the state-of-the-art classical algorithms for training these machine learning models. We show that the time and space complexities of our formulations are better (in the case of SVM and equal-sized k-means clustering) or equivalent (in case of linear regression) to their classical counterparts.

artificial intelligence, machine learning, quantum computer, (17 more...)

arXiv.org Machine Learning

2008.02369

Country:

North America > United States > Tennessee (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Missouri (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry: Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

Efficient Optimization of Dominant Set Clustering with Frank-Wolfe Algorithms

Johnell, Carl, Chehreghani, Morteza Haghir

arXiv.org Machine LearningAug-5-2020

We study Frank-Wolfe algorithms -- standard, pairwise, and away-steps -- for efficient optimization of Dominant Set Clustering. We present a unified and computationally efficient framework to employ the different variants of Frank-Wolfe methods, and we investigate its effectiveness via several experimental studies. In addition, we provide explicit convergence rates for the algorithms in terms of the so-called Frank-Wolfe gap. The theoretical analysis has been specialized to the problem of Dominant Set Clustering and is thus more easily accessible compared to prior work.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Machine Learning

2007.11652

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)
Information Technology > Artificial Intelligence > Natural Language (0.67)

Add feedback

Space-filling Curves for High-performance Data Mining

Böhm, Christian

arXiv.org Machine LearningAug-4-2020

Space-filling curves like the Hilbert-curve, Peano-curve and Z-order map natural or real numbers from a two or higher dimensional space to a one dimensional space preserving locality. They have numerous applications like search structures, computer graphics, numerical simulation, cryptographics and can be used to make various algorithms cache-oblivious. In this paper, we describe some details of the Hilbert-curve. We define the Hilbert-curve in terms of a finite automaton of Mealy-type which determines from the two-dimensional coordinate space the Hilbert order value and vice versa in a logarithmic number of steps. And we define a context-free grammar to generate the whole curve in a time which is linear in the number of generated coordinate/order value pairs, i.e. a constant time per coordinate pair or order value. We also review two different strategies which enable the generation of curves without the usual restriction to square-like grids where the side-length is a power of two. Finally, we elaborate on a few applications, namely matrix multiplication, Cholesky decomposition, the Floyd-Warshall algorithm, k-Means clustering, and the similarity join.

algorithm, mealy automaton, space-filling curve, (13 more...)

arXiv.org Machine Learning

2008.01684

Country:

Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > New Jersey > Atlantic County > Atlantic City (0.04)
(6 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Sketching Datasets for Large-Scale Learning (long version)

Gribonval, Rémi, Chatalic, Antoine, Keriven, Nicolas, Schellekens, Vincent, Jacques, Laurent, Schniter, Philip

arXiv.org Machine LearningAug-4-2020

This article considers "sketched learning," or "compressive learning," an approach to large-scale machine learning where datasets are massively compressed before learning (e.g., clustering, classification, or regression) is performed. In particular, a "sketch" is first constructed by computing carefully chosen nonlinear random features (e.g., random Fourier features) and averaging them over the whole dataset. Parameters are then learned from the sketch, without access to the original dataset. This article surveys the current state-of-the-art in sketched learning, including the main concepts and algorithms, their connections with established signal-processing methods, existing theoretical guarantees---on both information preservation and privacy preservation, and important open problems.

data mining, latexit sha1, machine learning, (19 more...)

arXiv.org Machine Learning

2008.01839

Country:

Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
North America > United States > Ohio > Franklin County > Columbus (0.04)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Genre:

Overview (1.00)
Research Report (0.83)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback