AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

MetaStackVis: Visually-Assisted Performance Evaluation of Metamodels

Ploshchik, Ilya, Chatzimparmpas, Angelos, Kerren, Andreas

arXiv.org Artificial IntelligenceJan-28-2023

Stacking (or stacked generalization) is an ensemble learning method with one main distinctiveness from the rest: even though several base models are trained on the original data set, their predictions are further used as input data for one or more metamodels arranged in at least one extra layer. Composing a stack of models can produce high-performance outcomes, but it usually involves a trial-and-error process. Therefore, our previously developed visual analytics system, StackGenVis, was mainly designed to assist users in choosing a set of top-performing and diverse models by measuring their predictive performance. However, it only employs a single logistic regression metamodel. In this paper, we investigate the impact of alternative metamodels on the performance of stacking ensembles using a novel visualization tool, called MetaStackVis. Our interactive tool helps users to visually explore different singular and pairs of metamodels according to their predictive probabilities and multiple validation metrics, as well as their ability to predict specific problematic data instances. MetaStackVis was evaluated with a usage scenario based on a medical data set and via expert interviews.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2212.03539

Country:

North America > United States > Wisconsin (0.04)
Europe > Sweden > Östergötland County > Linköping (0.04)
Asia (0.04)

Genre:

Research Report (0.90)
Personal > Interview (0.34)

Industry: Health & Medicine > Therapeutic Area (0.47)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Hard Sample Aware Network for Contrastive Deep Graph Clustering

Liu, Yue, Yang, Xihong, Zhou, Sihang, Liu, Xinwang, Wang, Zhen, Liang, Ke, Tu, Wenxuan, Li, Liang, Duan, Jingcan, Chen, Cancan

arXiv.org Artificial IntelligenceJan-28-2023

Contrastive deep graph clustering, which aims to divide nodes into disjoint groups via contrastive mechanisms, is a challenging research spot. Among the recent works, hard sample mining-based algorithms have achieved great attention for their promising performance. However, we find that the existing hard sample mining methods have two problems as follows. 1) In the hardness measurement, the important structural information is overlooked for similarity calculation, degrading the representativeness of the selected hard negative samples. 2) Previous works merely focus on the hard negative sample pairs while neglecting the hard positive sample pairs. Nevertheless, samples within the same cluster but with low similarity should also be carefully learned. To solve the problems, we propose a novel contrastive deep graph clustering method dubbed Hard Sample Aware Network (HSAN) by introducing a comprehensive similarity measure criterion and a general dynamic sample weighing strategy. Concretely, in our algorithm, the similarities between samples are calculated by considering both the attribute embeddings and the structure embeddings, better revealing sample relationships and assisting hardness measurement. Moreover, under the guidance of the carefully collected high-confidence clustering information, our proposed weight modulating function will first recognize the positive and negative samples and then dynamically up-weight the hard sample pairs while down-weighting the easy ones. In this way, our method can mine not only the hard negative samples but also the hard positive sample, thus improving the discriminative capability of the samples further. Extensive experiments and analyses demonstrate the superiority and effectiveness of our proposed method.

artificial intelligence, machine learning, sample pair, (18 more...)

arXiv.org Artificial Intelligence

2212.08665

Country:

North America > United States (0.14)
South America > Brazil (0.04)
Europe (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.64)

Industry: Transportation (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Neural Gas Network Image Features and Segmentation for Brain Tumor Detection Using Magnetic Resonance Imaging Data

Mousavi, S. Muhammad Hossein

arXiv.org Artificial IntelligenceJan-28-2023

Accurate detection of brain tumors could save lots of lives and increasing the accuracy of this binary classification even as much as a few percent has high importance. Neural Gas Networks (NGN) is a fast, unsupervised algorithm that could be used in data clustering, image pattern recognition, and image segmentation. In this research, we used the metaheuristic Firefly Algorithm (FA) for image contrast enhancement as pre-processing and NGN weights for feature extraction and segmentation of Magnetic Resonance Imaging (MRI) data on two brain tumor datasets from the Kaggle platform. Also, tumor classification is conducted by Support Vector Machine (SVM) classification algorithms and compared with a deep learning technique plus other features in train and test phases. Additionally, NGN tumor segmentation is evaluated by famous performance metrics such as Accuracy, F-measure, Jaccard, and more versus ground truth data and compared with traditional segmentation techniques. The proposed method is fast and precise in both tasks of tumor classification and segmentation compared with other methods. A classification accuracy of 95.14 % and segmentation accuracy of 0.977 is achieved by the proposed method.

artificial intelligence, machine learning, segmentation, (13 more...)

arXiv.org Artificial Intelligence

2301.12176

Country:

North America > United States > New York (0.04)
Asia > Singapore (0.04)
Asia > Middle East > Iran > Tehran Province > Tehran (0.04)

Genre: Research Report (0.40)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.61)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

Hierarchical clustering: visualization, feature importance and model selection

Cabezas, Luben M. C., Izbicki, Rafael, Stern, Rafael B.

arXiv.org Artificial IntelligenceJan-27-2023

We propose methods for the analysis of hierarchical clustering that fully use the multi-resolution structure provided by a dendrogram. Specifically, we propose a loss for choosing between clustering methods, a feature importance score and a graphical tool for visualizing the segmentation of features in a dendrogram. Current approaches to these tasks lead to loss of information since they require the user to generate a single partition of the instances by cutting the dendrogram at a specified level. Our proposed methods, instead, use the full structure of the dendrogram. The key insight behind the proposed methods is to view a dendrogram as a phylogeny. This analogy permits the assignment of a feature value to each internal node of a tree through an evolutionary model. Real and simulated datasets provide evidence that our proposed framework has desirable outcomes and gives more insights than state-of-art approaches. We provide an R package that implements our methods.

artificial intelligence, dendrogram, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2112.01372

Country:

North America > United States > Georgia (0.14)
North America > United States > Utah (0.05)
North America > United States > Louisiana (0.05)
(40 more...)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A Closer Look at Novel Class Discovery from the Labeled Set

Li, Ziyun, Otholt, Jona, Dai, Ben, hu, Di, Meinel, Christoph, Yang, Haojin

arXiv.org Artificial IntelligenceJan-27-2023

Novel class discovery (NCD) aims to infer novel categories in an unlabeled dataset leveraging prior knowledge of a labeled set comprising disjoint but related classes. Existing research focuses primarily on utilizing the labeled set at the methodological level, with less emphasis on the analysis of the labeled set itself. Thus, in this paper, we rethink novel class discovery from the labeled set and focus on two core questions: (i) Given a specific unlabeled set, what kind of labeled set can best support novel class discovery? (ii) A fundamental premise of NCD is that the labeled set must be related to the unlabeled set, but how can we measure this relation? For (i), we propose and substantiate the hypothesis that NCD could benefit more from a labeled set with a large degree of semantic similarity to the unlabeled set. Specifically, we establish an extensive and large-scale benchmark with varying degrees of semantic similarity between labeled/unlabeled datasets on ImageNet by leveraging its hierarchical class structure. As a sharp contrast, the existing NCD benchmarks are developed based on labeled sets with different number of categories and images, and completely ignore the semantic relation. For (ii), we introduce a mathematical definition for quantifying the semantic similarity between labeled and unlabeled sets. In addition, we use this metric to confirm the validity of our proposed benchmark and demonstrate that it highly correlates with NCD performance. Furthermore, without quantitative analysis, previous works commonly believe that label information is always beneficial. However, counterintuitively, our experimental results show that using labels may lead to sub-optimal outcomes in low-similarity settings.

machine learning, natural language, transfer leakage, (18 more...)

arXiv.org Artificial Intelligence

2209.0912

Country:

North America > United States > California > Alameda County > Oakland (0.04)
Asia > China > Hong Kong (0.04)
Africa > Madagascar (0.04)

Genre: Research Report > New Finding (0.66)

Industry:

Transportation > Ground > Road (1.00)
Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Cluster-Based Control of Transition-Independent MDPs

Fiscko, Carmel, Kar, Soummya, Sinopoli, Bruno

arXiv.org Artificial IntelligenceJan-26-2023

This work studies efficient solution methods for cluster-based control policies of transition-independent Markov decision processes (TI-MDPs). We focus on control of multi-agent systems, whereby a central planner (CP) influences agents to select desirable group behavior. The agents are partitioned into disjoint clusters whereby agents in the same cluster receive the same controls but agents in different clusters may receive different controls. Under mild assumptions, this process can be modeled as a TI-MDP where each factor describes the behavior of one cluster. The action space of the TI-MDP becomes exponential with respect to the number of clusters. To efficiently find a policy in this rapidly scaling space, we propose a clustered Bellman operator that optimizes over the action space for one cluster at any evaluation. We present Clustered Value Iteration (CVI), which uses this operator to iteratively perform "round robin" optimization across the clusters. CVI converges exponentially faster than standard value iteration (VI), and can find policies that closely approximate the MDP's true optimal value. A special class of TI-MDPs with separable reward functions are investigated, and it is shown that CVI will find optimal policies on this class of problems. Finally, the optimal clustering assignment problem is explored. The value functions TI-MDPs with submodular reward functions are shown to be submodular functions, so submodular set optimization may be used to find a near optimal clustering assignment. We propose an iterative greedy cluster splitting algorithm, which yields monotonic submodular improvement in value at each iteration. Finally, simulations offer empirical assessment of the proposed methods.

agent, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2207.05224

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
North America > United States > Missouri > St. Louis County > St. Louis (0.04)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Add feedback

AlignGraph: A Group of Generative Models for Graphs

Shayestehfard, Kimia, Brooks, Dana, Ioannidis, Stratis

arXiv.org Artificial IntelligenceJan-26-2023

This is a problem because state-of-the-art generative models, like the ones listed above, rely on latent It is challenging for generative models to learn a distribution node embeddings. Such embeddings vary drastically over graphs because of the lack of permutation even under nearly isomorphic graphs [13]. In turn, this invariance: nodes may be ordered arbitrarily across can hamper the fidelity of the graph generation process graphs, and standard graph alignment is combinatorial significantly. Note that this is a much harder setting and notoriously expensive. We propose AlignGraph, a than, e.g., images or text, where inputs have a canonical group of generative models that combine fast and efficient orientation. Finding the correspondence between graph graph alignment methods with a family of deep nodes is a notoriously hard problem [14, 15, 11, 16], and generative models that are invariant to node permutations.

data mining, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2301.11273

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Hungary > Hajdú-Bihar County > Debrecen (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (1.00)
Information Technology > Data Science > Data Mining (0.93)
(2 more...)

Add feedback

Cross Modal Global Local Representation Learning from Radiology Reports and X-Ray Chest Images

Hadjiyski, Nathan, Vosoughi, Ali, Wismueller, Axel

arXiv.org Artificial IntelligenceJan-26-2023

Deep learning models can be applied successfully in real-work problems; however, training most of these models requires massive data. Recent methods use language and vision, but unfortunately, they rely on datasets that are not usually publicly available. Here we pave the way for further research in the multimodal language-vision domain for radiology. In this paper, we train a representation learning method that uses local and global representations of the language and vision through an attention mechanism and based on the publicly available Indiana University Radiology Report (IU-RR) dataset. Furthermore, we use the learned representations to diagnose five lung pathologies: atelectasis, cardiomegaly, edema, pleural effusion, and consolidation. Finally, we use both supervised and zero-shot classifications to extensively analyze the performance of the representation learning on the IU-RR dataset. Average Area Under the Curve (AUC) is used to evaluate the accuracy of the classifiers for classifying the five lung pathologies. The average AUC for classifying the five lung pathologies on the IU-RR test set ranged from 0.85 to 0.87 using the different training datasets, namely CheXpert and CheXphoto. These results compare favorably to other studies using UI-RR. Extensive experiments confirm consistent results for classifying lung pathologies using the multimodal global local representations of language and vision information.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2301.10951

Country:

North America > United States > Indiana (0.25)
North America > United States > New York > Monroe County > Rochester (0.05)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)

Genre: Research Report > Experimental Study (0.47)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Re-embedding data to strengthen recovery guarantees of clustering

Jiang, Tao, Tan, Samuel, Vavasis, Stephen

arXiv.org Artificial IntelligenceJan-25-2023

We propose a clustering method that involves chaining four known techniques into a pipeline yielding an algorithm with stronger recovery guarantees than any of the four components separately. Given $n$ points in $\mathbb R^d$, the first component of our pipeline, which we call leapfrog distances, is reminiscent of density-based clustering, yielding an $n\times n$ distance matrix. The leapfrog distances are then translated to new embeddings using multidimensional scaling and spectral methods, two other known techniques, yielding new embeddings of the $n$ points in $\mathbb R^{d'}$, where $d'$ satisfies $d'\ll d$ in general. Finally, sum-of-norms (SON) clustering is applied to the re-embedded points. Although the fourth step (SON clustering) can in principle be replaced by any other clustering method, our focus is on provable guarantees of recovery of underlying structure. Therefore, we establish that the re-embedding improves recovery SON clustering, since SON clustering is a well-studied method that already has provable guarantees.

artificial intelligence, leapfrog distance, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2301.10901

Country:

North America > United States > New York > Tompkins County > Ithaca (0.04)
North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)
Asia > China > Jiangsu Province > Yancheng (0.04)

Genre: Research Report > New Finding (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Improving Spectral Clustering Using Spectrum-Preserving Node Aggregation

Wang, Yongyu

arXiv.org Artificial IntelligenceJan-23-2023

Spectral clustering is one of the most popular clustering methods. However, the high computational cost due to the involved eigen-decomposition procedure can immediately hinder its applications in large-scale tasks. In this paper we use spectrum-preserving node reduction to accelerate eigen-decomposition and generate concise representations of data sets. Specifically, we create a small number of pseudonodes based on spectral similarity. Then, standard spectral clustering algorithm is performed on the smaller node set. Finally, each data point in the original data set is assigned to the cluster as its representative pseudo-node. The proposed framework run in nearly-linear time. Meanwhile, the clustering accuracy can be significantly improved by mining concise representations. The experimental results show dramatically improved clustering performance when compared with state-of-the-art methods.

artificial intelligence, machine learning, spectral, (16 more...)

arXiv.org Artificial Intelligence

2110.12328

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
Asia > Middle East > Jordan (0.05)
North America > United States > Michigan (0.04)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback