AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Shape complexity in cluster analysis

Aguilar, Eduardo J., Barbosa, Valmir C.

arXiv.org Artificial IntelligenceSep-5-2022

In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like division by the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called "midrange" distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used.

artificial intelligence, dimension, machine learning, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1371/journal.pone.0286312

2205.08046

Country:

North America > United States > Wisconsin (0.05)
South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Autonomous Cross Domain Adaptation under Extreme Label Scarcity

Weng, Weiwei, Pratama, Mahardhika, Za'in, Choiru, De Carvalho, Marcus, Appan, Rakaraddi, Ashfahani, Andri, Yee, Edward Yapp Kien

arXiv.org Artificial IntelligenceSep-4-2022

A cross domain multistream classification is a challenging problem calling for fast domain adaptations to handle different but related streams in never-ending and rapidly changing environments. Notwithstanding that existing multistream classifiers assume no labelled samples in the target stream, they still incur expensive labelling cost since they require fully labelled samples of the source stream. This paper aims to attack the problem of extreme label shortage in the cross domain multistream classification problems where only very few labelled samples of the source stream are provided before process runs. Our solution, namely Learning Streaming Process from Partial Ground Truth (LEOPARD), is built upon a flexible deep clustering network where its hidden nodes, layers and clusters are added and removed dynamically in respect to varying data distributions. A deep clustering strategy is underpinned by a simultaneous feature learning and clustering technique leading to clustering-friendly latent spaces. A domain adaptation strategy relies on the adversarial domain adaptation technique where a feature extractor is trained to fool a domain classifier classifying source and target streams. Our numerical study demonstrates the efficacy of LEOPARD where it delivers improved performances compared to prominent algorithms in 15 of 24 cases. Source codes of LEOPARD are shared in \url{https://github.com/wengweng001/LEOPARD.git} to enable further study.

adaptation, leopard, target stream, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TNNLS.2022.3183356

2209.01548

Country:

Asia > Singapore (0.05)
Oceania > Australia > South Australia > Adelaide (0.04)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Nonbacktracking spectral clustering of nonuniform hypergraphs

Chodrow, Philip, Eikmeier, Nicole, Haddock, Jamie

arXiv.org Artificial IntelligenceSep-3-2022

Spectral methods offer a tractable, global framework for clustering in graphs via eigenvector computations on graph matrices. Hypergraph data, in which entities interact on edges of arbitrary size, poses challenges for matrix representations and therefore for spectral clustering. We study spectral clustering for nonuniform hypergraphs based on the hypergraph nonbacktracking operator. After reviewing the definition of this operator and its basic properties, we prove a theorem of Ihara-Bass type which allows eigenpair computations to take place on a smaller matrix, often enabling faster computation. We then propose an alternating algorithm for inference in a hypergraph stochastic blockmodel via linearized belief-propagation which involves a spectral clustering step again using nonbacktracking operators. We provide proofs related to this algorithm that both formalize and extend several previous results. We pose several conjectures about the limits of spectral methods and detectability in hypergraph stochastic blockmodels in general, supporting these with in-expectation analysis of the eigeinpairs of our studied operators. We perform experiments in real and synthetic data that demonstrate the benefits of hypergraph methods over graph-based ones when interactions of different sizes carry different information about cluster structure.

eigenvalue, eigenvector, hypergraph, (15 more...)

arXiv.org Artificial Intelligence

2204.13586

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > Austria > Vienna (0.14)
North America > United States > New York (0.04)
(3 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Government (0.67)
Education (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

Add feedback

Hypergraph convolutional neural network-based clustering technique

Tran, Loc H., Trinh, Nguyen, Tran, Linh H.

arXiv.org Artificial IntelligenceSep-3-2022

This paper constitutes the novel hypergraph convolutional neural networkbased clustering technique. This technique is employed to solve the clustering problem for the Citeseer dataset and the Cora dataset. Each dataset contains the feature matrix and the incidence matrix of the hypergraph (i.e., constructed from the feature matrix). This novel clustering method utilizes both matrices. Initially, the hypergraph auto-encoders are employed to transform both the incidence matrix and the feature matrix from high dimensional space to low dimensional space. In the end, we apply the k-means clustering technique to the transformed matrix. The hypergraph convolutional neural network (CNN)-based clustering technique presented a better result on performance during experiments than those of the other classical clustering techniques.

cnn-based, convolutional neural network-based, dataset, (13 more...)

arXiv.org Artificial Intelligence

2209.01391

Country:

Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.06)
North America > United States > New York (0.04)
North America > United States > Minnesota (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Higher-order Clustering and Pooling for Graph Neural Networks

Duval, Alexandre, Malliaros, Fragkiskos

arXiv.org Artificial IntelligenceSep-2-2022

Graph Neural Networks achieve state-of-the-art performance on a plethora of graph classification tasks, especially due to pooling operators, which aggregate learned node embeddings hierarchically into a final graph representation. However, they are not only questioned by recent work showing on par performance with random pooling, but also ignore completely higher-order connectivity patterns. To tackle this issue, we propose HoscPool, a clustering-based graph pooling operator that captures higher-order information hierarchically, leading to richer graph representations. In fact, we learn a probabilistic cluster assignment matrix end-to-end by minimising relaxed formulations of motif spectral clustering in our objective function, and we then extend it to a pooling operator. We evaluate HoscPool on graph classification tasks and its clustering component on graphs with ground-truth community structure, achieving best performance. Lastly, we provide a deep empirical analysis of pooling operators' inner functioning.

dataset, graph, node, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3511808.3557353

2209.03473

Country:

North America > United States > Georgia > Fulton County > Atlanta (0.05)
Europe > France (0.04)
Africa > Senegal > Kolda Region > Kolda (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

Add feedback

Classifying with Uncertain Data Envelopment Analysis

Garner, Casey, Holder, Allen

arXiv.org Artificial IntelligenceSep-2-2022

Classifications organize entities into categories that identify similarities within a category and discern dissimilarities among categories, and they powerfully classify information in support of analysis. We propose a new classification scheme premised on the reality of imperfect data. Our computational model uses uncertain data envelopment analysis to define a classification's proximity to equitable efficiency, which is an aggregate measure of intra-similarity within a classification's categories. Our classification process has two overriding computational challenges, those being a loss of convexity and a combinatorially explosive search space. We overcome the first by establishing lower and upper bounds on the proximity value, and then by searching this range with a first-order algorithm. We overcome the second by adapting the p-median problem to initiate our exploration, and by then employing an iterative neighborhood search to finalize a classification. We conclude by classifying the thirty stocks in the Dow Jones Industrial average into performant tiers and by classifying prostate treatments into clinically effectual categories.

category, classification, efficiency, (14 more...)

arXiv.org Artificial Intelligence

2209.01052

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
North America > United States > New York (0.04)
Oceania > New Zealand (0.04)
(2 more...)

Genre:

Research Report (0.64)
Overview (0.46)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

Add feedback

Clustering Algorithm Fundamentals and an Implementation in Python

#artificialintelligenceSep-1-2022, 14:22:24 GMT

Clustering is a method that can help machine learning engineers understand unlabeled data by creating meaningful groups or clusters. This often reveals patterns in data, which can be a useful first step in machine learning. Since the data you are working with is unlabeled, clustering is an unsupervised machine learning task. Data is categorized into groups based on their similarity to each other through a metric known as the similarity measure, which is used to find out how similar the objects in the dataset are. To calculate this similarity measure, the feature data of the object in the dataset is used.

algorithm, clustering algorithm, dataset, (14 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A topic-aware graph neural network model for knowledge base updating

Tong, Jiajun, Wang, Zhixiao, Rui, Xiaobin

arXiv.org Artificial IntelligenceSep-1-2022

The open domain knowledge base is very important. It is usually extracted from encyclopedia websites and is widely used in knowledge retrieval systems, question answering systems, or recommendation systems. In practice, the key challenge is to maintain an up-to-date knowledge base. Different from Unwieldy fetching all of the data from the encyclopedia dumps, to enlarge the freshness of the knowledge base as big as possible while avoiding invalid fetching, the current knowledge base updating methods usually determine whether entities need to be updated by building a prediction model. However, these methods can only be defined in some specific fields and the result turns out to be obvious bias, due to the problem of data source and data structure. The users' query intentions are often diverse as to the open domain knowledge, so we construct a topic-aware graph network for knowledge updating based on the user query log. Our methods can be summarized as follow: 1. Extract entities through the user's log and select them as seeds 2. Scrape the attributes of seed entities in the encyclopedia website, and self-supervised construct the entity attribute graph for each entity. 3. Use the entity attribute graph to train the GNN entity update model to determine whether the entity needs to be synchronized. 4.Use the encyclopedia knowledge to match and update the filtered entity with the entity in the knowledge base according to the minimum edit times algorithm.

artificial intelligence, knowledge management, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2208.14601

Country:

Asia > China (0.05)
North America > United States > New York (0.04)
Asia > Indonesia (0.04)

Genre: Research Report (0.50)

Industry:

Government > Regional Government > North America Government > United States Government (0.95)
Media (0.68)

Technology:

Information Technology > Knowledge Management > Knowledge Engineering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)

Add feedback

Building the Intent Landscape of Real-World Conversational Corpora with Extractive Question-Answering Transformers

Corbeil, Jean-Philippe, Li, Mia Taige, Ghavidel, Hadi Abdi

arXiv.org Artificial IntelligenceAug-30-2022

For companies with customer service, mapping intents inside their conversational data is crucial in building applications based on natural language understanding (NLU). Nevertheless, there is no established automated technique to gather the intents from noisy online chats or voice transcripts. Simple clustering approaches are not suited to intent-sparse dialogues. To solve this intent-landscape task, we propose an unsupervised pipeline that extracts the intents and the taxonomy of intents from real-world dialogues. Our pipeline mines intent-span candidates with an extractive Question-Answering Electra model and leverages sentence embeddings to apply a low-level density clustering followed by a top-level hierarchical clustering. Our results demonstrate the generalization ability of an ELECTRA large model fine-tuned on the SQuAD2 dataset to understand dialogues. With the right prompting question, this model achieves a rate of linguistic validation on intent spans beyond 85%. We furthermore reconstructed the intent schemes of five domains from the MultiDoGo dataset with an average recall of 94.3%.

dataset, dialogue, proceedings, (14 more...)

arXiv.org Artificial Intelligence

2208.12886

Country:

North America > Canada (0.04)
North America > United States > Texas (0.04)
North America > United States > New York (0.04)
(6 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.62)

Add feedback

k-MS: A novel clustering algorithm based on morphological reconstruction

Rodrigues, É. O., Torok, L., Liatsis, P., Viterbo, J., Conci, A.

arXiv.org Artificial IntelligenceAug-30-2022

This work proposes a clusterization algorithm called k-Morphological Sets (k-MS), based on morphological reconstruction and heuristics. k-MS is faster than the CPU-parallel k-Means in worst case scenarios and produces enhanced visualizations of the dataset as well as very distinct clusterizations. It is also faster than similar clusterization methods that are sensitive to density and shapes such as Mitosis and TRICLUST. In addition, k-MS is deterministic and has an intrinsic sense of maximal clusters that can be created for a given input sample and input parameters, differing from k-Means and other clusterization algorithms. In other words, given a constant k, a structuring element and a dataset, k-MS produces k or less clusters without using random/ pseudo-random functions. Finally, the proposed algorithm also provides a straightforward means for removing noise from images or datasets in general.

algorithm, morphological reconstruction, reconstruction, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.patcog.2016.12.027

2208.1439

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
South America > Brazil > Rio de Janeiro > Niterói (0.14)
South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
(4 more...)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback