AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

How Does DBSCAN Clustering Work?

#artificialintelligenceSep-27-2021, 03:51:11 GMT

Mastering unsupervised learning opens up a broad range of avenues for a data scientist. There is so much scope in the vast expanse of unsupervised learning and yet a lot of beginners in machine learning tend to shy away from it. In fact, I'm sure most newcomers will stick to basic clustering algorithms like K-Means clustering and hierarchical clustering. While there's nothing wrong with that approach, it does limit what you can do when faced with clustering projects. And why limit yourself when you can expand your learning, knowledge, and skillset by learning the powerful DBSCAN clustering algorithm?

algorithm, dbscan, minpoint, (14 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Dimension Reduction for Data with Heterogeneous Missingness

Ling, Yurong, Liu, Zijing, Xue, Jing-Hao

arXiv.org Machine LearningSep-27-2021

Dimension reduction plays a pivotal role in analysing high-dimensional data. However, observations with missing values present serious difficulties in directly applying standard dimension reduction techniques. As a large number of dimension reduction approaches are based on the Gram matrix, we first investigate the effects of missingness on dimension reduction by studying the statistical properties of the Gram matrix with or without missingness, and then we present a bias-corrected Gram matrix with nice statistical properties under heterogeneous missingness. Extensive empirical results, on both simulated and publicly available real datasets, show that the proposed unbiased Gram matrix can significantly improve a broad spectrum of representative dimension reduction approaches.

dataset, matrix, visualisation, (16 more...)

arXiv.org Machine Learning

2109.11765

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
North America > United States > Virginia > Arlington County > Arlington (0.04)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning in High Dimensional Spaces (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Path Based Hierarchical Clustering on Knowledge Graphs

Pietrasik, Marcin, Reformat, Marek

arXiv.org Artificial IntelligenceSep-27-2021

Knowledge graphs have emerged as a widely adopted medium for storing relational data, making methods for automatically reasoning with them highly desirable. In this paper, we present a novel approach for inducing a hierarchy of subject clusters, building upon our earlier work done in taxonomy induction. Our method first constructs a tag hierarchy before assigning subjects to clusters on this hierarchy. We quantitatively demonstrate our method's ability to induce a coherent cluster hierarchy on three real-world datasets.

cluster hierarchy, hierarchy, knowledge graph, (14 more...)

arXiv.org Artificial Intelligence

2109.13178

Country:

South America > Uruguay > Montevideo > Montevideo (0.04)
South America > Colombia (0.04)
North America > Cuba > La Habana Province > Havana (0.04)
(8 more...)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.79)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.51)

Add feedback

A Clustering and Demotion Based Algorithm for Inductive Learning of Default Theories

Wang, Huaduo, Shakerin, Farhad, Gupta, Gopal

arXiv.org Artificial IntelligenceSep-26-2021

We present a clustering- and demotion-based algorithm called Kmeans-FOLD to induce nonmonotonic logic programs from positive and negative examples. Our algorithm improves upon-and is inspired by-the FOLD algorithm. The FOLD algorithm itself is an improvement over the FOIL algorithm. Our algorithm generates a more concise logic program compared to the FOLD algorithm. Our algorithm uses the K-means based clustering method to cluster the input positive samples before applying the FOLD algorithm. Positive examples that are covered by the partially learned program in intermediate steps are not discarded as in the FOLD algorithm, rather they are demoted, i.e., their weights are reduced in subsequent iterations of the algorithm. Our experiments on the UCI dataset show that a combination of K-Means clustering and our demotion strategy produces significant improvement for datasets with more than one cluster of positive examples. The resulting induced program is also more concise and therefore easier to understand compared to the FOLD and ALEPH systems, two state of the art inductive logic programming (ILP) systems.

algorithm, dataset, fold algorithm, (16 more...)

arXiv.org Artificial Intelligence

2109.12624

Country:

North America > United States > Wisconsin (0.04)
North America > United States > Texas (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(4 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Modelling the transition to a low-carbon energy supply

Kell, Alexander

arXiv.org Artificial IntelligenceSep-25-2021

A transition to a low-carbon electricity supply is crucial to limit the impacts of climate change. Reducing carbon emissions could help prevent the world from reaching a tipping point, where runaway emissions are likely. Runaway emissions could lead to extremes in weather conditions around the world -- especially in problematic regions unable to cope with these conditions. However, the movement to a low-carbon energy supply can not happen instantaneously due to the existing fossil-fuel infrastructure and the requirement to maintain a reliable energy supply. Therefore, a low-carbon transition is required, however, the decisions various stakeholders should make over the coming decades to reduce these carbon emissions are not obvious. This is due to many long-term uncertainties, such as electricity, fuel and generation costs, human behaviour and the size of electricity demand. A well choreographed low-carbon transition is, therefore, required between all of the heterogenous actors in the system, as opposed to changing the behaviour of a single, centralised actor. The objective of this thesis is to create a novel, open-source agent-based model to better understand the manner in which the whole electricity market reacts to different factors using state-of-the-art machine learning and artificial intelligence methods. In contrast to other works, this thesis looks at both the long-term and short-term impact that different behaviours have on the electricity market by using these state-of-the-art methods.

evolutionary algorithm, machine learning, wind energy, (30 more...)

arXiv.org Artificial Intelligence

2111.00987

Country:

Asia (0.67)
Oceania > Australia (0.27)
Europe > Italy (0.14)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (1.00)
Instructional Material (1.00)

Industry:

Materials > Metals & Mining > Coal (1.00)
Energy > Renewable > Wind (1.00)
Energy > Renewable > Solar (1.00)
(3 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
(10 more...)

Add feedback

Integrating Unsupervised Clustering and Label-specific Oversampling to Tackle Imbalanced Multi-label Data

Sadhukhan, Payel, Pakrashi, Arjun, Palit, Sarbani, Mac Namee, Brian

arXiv.org Artificial IntelligenceSep-25-2021

There is often a mixture of very frequent labels and very infrequent labels in multi-label datatsets. This variation in label frequency, a type class imbalance, creates a significant challenge for building efficient multi-label classification algorithms. In this paper, we tackle this problem by proposing a minority class oversampling scheme, UCLSO, which integrates Unsupervised Clustering and Label-Specific data Oversampling. Clustering is performed to find out the key distinct and locally connected regions of a multi-label dataset (irrespective of the label information). Next, for each label, we explore the distributions of minority points in the cluster sets. Only the minority points within a cluster are used to generate the synthetic minority points that are used for oversampling. Even though the cluster set is the same across all labels, the distributions of the synthetic minority points will vary across the labels. The training dataset is augmented with the set of label-specific synthetic minority points, and classifiers are trained to predict the relevance of each label independently. Experiments using 12 multi-label datasets and several multi-label algorithms show that the proposed method performed very well compared to the other competing algorithms.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2109.12421

Country:

Asia > India > West Bengal > Kolkata (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)

Add feedback

Cluster-based Mention Typing for Named Entity Disambiguation

Çelebi, Arda, Özgür, Arzucan

arXiv.org Artificial IntelligenceSep-23-2021

An entity mention in text such as "Washington" may correspond to many different named entities such as the city "Washington D.C." or the newspaper "Washington Post." The goal of named entity disambiguation is to identify the mentioned named entity correctly among all possible candidates. If the type (e.g. location or person) of a mentioned entity can be correctly predicted from the context, it may increase the chance of selecting the right candidate by assigning low probability to the unlikely ones. This paper proposes cluster-based mention typing for named entity disambiguation. The aim of mention typing is to predict the type of a given mention based on its context. Generally, manually curated type taxonomies such as Wikipedia categories are used. We introduce cluster-based mention typing, where named entities are clustered based on their contextual similarities and the cluster ids are assigned as types. The hyperlinked mentions and their context in Wikipedia are used in order to obtain these cluster-based types. Then, mention typing models are trained on these mentions, which have been labeled with their cluster-based types through distant supervision. At the named entity disambiguation phase, first the cluster-based types of a given mention are predicted and then, these types are used as features in a ranking model to select the best entity among the candidates. We represent entities at multiple contextual levels and obtain different clusterings (and thus typing models) based on each level. As each clustering breaks the entity space differently, mention typing based on each clustering discriminates the mention differently. When predictions from all typing models are used together, our system achieves better or comparable results based on randomization tests with respect to the state-of-the-art levels on four defacto test sets.

machine learning, natural language, surface form, (21 more...)

arXiv.org Artificial Intelligence

doi: 10.1017/S1351324920000443

2109.11389

Country:

North America > United States > District of Columbia > Washington (0.24)
Europe > France (0.04)
North America > United States > Washington (0.04)
(5 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Media > News (0.54)
Government > Regional Government > North America Government > United States Government (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Clustering performance analysis using new correlation based cluster validity indices

Wiroonsri, Nathakhun

arXiv.org Machine LearningSep-23-2021

There are various cluster validity measures used for evaluating clustering results. One of the main objective of using these measures is to seek the optimal unknown number of clusters. Some measures work well for clusters with different densities, sizes and shapes. Yet, one of the weakness that those validity measures share is that they sometimes provide only one clear optimal number of clusters. That number is actually unknown and there might be more than one potential sub-optimal options that a user may wish to choose based on different applications. We develop two new cluster validity indices based on a correlation between an actual distance between a pair of data points and a centroid distance of clusters that the two points locate in. Our proposed indices constantly yield several peaks at different numbers of clusters which overcome the weakness previously stated. Furthermore, the introduced correlation can also be used for evaluating the quality of a selected clustering result. Several experiments in different scenarios including the well-known iris data set and a real-world marketing application have been conducted in order to compare the proposed validity indices with several well-known ones.

correlation, optimal number, validity measure, (12 more...)

arXiv.org Machine Learning

2109.11172

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.40)

Industry: Health & Medicine (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Scalable Fact-checking with Human-in-the-Loop

Yang, Jing, Vega-Oliveros, Didier, Seibt, Tais, Rocha, Anderson

arXiv.org Artificial IntelligenceSep-22-2021

Researchers have been investigating automated solutions for fact-checking in a variety of fronts. However, current approaches often overlook the fact that the amount of information released every day is escalating, and a large amount of them overlap. Intending to accelerate fact-checking, we bridge this gap by grouping similar messages and summarizing them into aggregated claims. Specifically, we first clean a set of social media posts (e.g., tweets) and build a graph of all posts based on their semantics; Then, we perform two clustering methods to group the messages for further claim summarization. We evaluate the summaries both quantitatively with ROUGE scores and qualitatively with human evaluation. We also generate a graph of summaries to verify that there is no significant overlap among them. The results reduced 28,818 original messages to 700 summary claims, showing the potential to speed up the fact-checking process by organizing and selecting representative claims from massive disorganized and redundant messages.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/WIFS53200.2021.9648388

2109.10992

Country:

Europe > Netherlands > South Holland > Leiden (0.08)
South America > Brazil > São Paulo > Campinas (0.04)
South America > Brazil > Rio Grande do Sul > Porto Alegre (0.04)
(4 more...)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Epidemiology (0.76)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

High-dimensional Bayesian Optimization for CNN Auto Pruning with Clustering and Rollback

Mu, Jiandong, Fan, Hanwei, Zhang, Wei

arXiv.org Artificial IntelligenceSep-22-2021

Pruning has been widely used to slim convolutional neural network (CNN) models to achieve a good trade-off between accuracy and model size so that the pruned models become feasible for power-constrained devices such as mobile phones. This process can be automated to avoid the expensive hand-crafted efforts and to explore a large pruning space automatically so that the high-performance pruning policy can be achieved efficiently. Nowadays, reinforcement learning (RL) and Bayesian optimization (BO)-based auto pruners are widely used due to their solid theoretical foundation, universality, and high compressing quality. However, the RL agent suffers from long training times and high variance of results, while the BO agent is time-consuming for high-dimensional design spaces. In this work, we propose an enhanced BO agent to obtain significant acceleration for auto pruning in high-dimensional design spaces. To achieve this, a novel clustering algorithm is proposed to reduce the dimension of the design space to speedup the searching process. Then, a roll-back algorithm is proposed to recover the high-dimensional design space so that higher pruning accuracy can be obtained. We validate our proposed method on ResNet, MobileNet, and VGG models, and our experiments show that the proposed method significantly improves the accuracy of BO when pruning very deep CNN models. Moreover, our method achieves lower variance and shorter time than the RL-based counterpart.

accuracy, agent, design space, (17 more...)

arXiv.org Artificial Intelligence

2109.10591

Country: Asia > China > Hong Kong (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback