AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Manifold-based Incomplete Multi-view Clustering via Bi-Consistency Guidance

Wang, Huibing, Yao, Mingze, Chen, Yawei, Xu, Yunqiu, Liu, Haipeng, Jia, Wei, Fu, Xianping, Wang, Yang

arXiv.org Artificial IntelligenceMay-16-2024

Incomplete multi-view clustering primarily focuses on dividing unlabeled data into corresponding categories with missing instances, and has received intensive attention due to its superiority in real applications. Considering the influence of incomplete data, the existing methods mostly attempt to recover data by adding extra terms. However, for the unsupervised methods, a simple recovery strategy will cause errors and outlying value accumulations, which will affect the performance of the methods. Broadly, the previous methods have not taken the effectiveness of recovered instances into consideration, or cannot flexibly balance the discrepancies between recovered data and original data. To address these problems, we propose a novel method termed Manifold-based Incomplete Multi-view clustering via Bi-consistency guidance (MIMB), which flexibly recovers incomplete data among various views, and attempts to achieve biconsistency guidance via reverse regularization. In particular, MIMB adds reconstruction terms to representation learning by recovering missing instances, which dynamically examines the latent consensus representation. Moreover, to preserve the consistency information among multiple views, MIMB implements a biconsistency guidance strategy with reverse regularization of the consensus representation and proposes a manifold embedding measure for exploring the hidden structure of the recovered data. Notably, MIMB aims to balance the importance of different views, and introduces an adaptive weight term for each view. Finally, an optimization algorithm with an alternating iteration optimization strategy is designed for final clustering. Extensive experimental results on 6 benchmark datasets are provided to confirm that MIMB can significantly obtain superior results as compared with several state-of-the-art baselines.

consensus representation, information, representation, (16 more...)

arXiv.org Artificial Intelligence

2405.10987

Country:

Oceania > Australia > New South Wales > Sydney (0.14)
Asia > China > Liaoning Province > Dalian (0.05)
Asia > China > Anhui Province > Hefei (0.05)
(6 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
(2 more...)

Add feedback

A Polynomial-Time Approximation for Pairwise Fair $k$-Median Clustering

Bandyapadhyay, Sayan, Chlamtáč, Eden, Makarychev, Yury, Vakilian, Ali

arXiv.org Artificial IntelligenceMay-16-2024

Clustering is a fundamental task in theoretical computer science and machine learning aimed at dividing a set of data items into several groups or clusters, such that each group contains similar data items. Typically, the similarity between data items is measured using a metric distance function. Clustering is often modeled as an optimization problem where the objective is to minimize a global cost function that reflects the quality of the clusters; this function varies depending on the application. Among the many cost functions studied for clustering, the most popular are k-median, k-means, and k-center. These objectives generally aim to minimize the variance within the clusters, serving as a proxy for grouping similar data items In this work, we study clustering problems with fairness constraints, commonly known as fair clustering problems. Fair clustering emerged as one of the most active research areas in algorithms motivated by the recent trend of research on fairness in artificial intelligence. In a seminal work, Chierichetti et al. [18] introduced a fair clustering problem, where given a set R of red points, a set B of blue points, and an integer balance parameter t 1, a clustering is said to be balanced if, in every cluster, the number of red points is at least 1/t times the number of blue points and at most t times the number of blue points.

algorithm, flow transfer, fractionally, (12 more...)

arXiv.org Artificial Intelligence

2405.10378

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.74)

Add feedback

Rectified Gaussian kernel multi-view k-means clustering

Sinaga, Kristina P.

arXiv.org Artificial IntelligenceMay-16-2024

In this paper, we show two new variants of multi-view k-means (MVKM) algorithms to address multi-view data. The general idea is to outline the distance between $h$-th view data points $x_i^h$ and $h$-th view cluster centers $a_k^h$ in a different manner of centroid-based approach. Unlike other methods, our proposed methods learn the multi-view data by calculating the similarity using Euclidean norm in the space of Gaussian-kernel, namely as multi-view k-means with exponent distance (MVKM-ED). By simultaneously aligning the stabilizer parameter $p$ and kernel coefficients $\beta^h$, the compression of Gaussian-kernel based weighted distance in Euclidean norm reduce the sensitivity of MVKM-ED. To this end, this paper designated as Gaussian-kernel multi-view k-means (GKMVKM) clustering algorithm. Numerical evaluation of five real-world multi-view data demonstrates the robustness and efficiency of our proposed MVKM-ED and GKMVKM approaches.

algorithm, aussian kernel multi, mvkm-ed, (15 more...)

arXiv.org Artificial Intelligence

2405.05619

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(3 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Generation of Granular-Balls for Clustering Based on the Principle of Justifiable Granularity

Jia, Zihang, Zhang, Zhen, Pedrycz, Witold

arXiv.org Artificial IntelligenceMay-15-2024

Efficient and robust data clustering remains a challenging task in the field of data analysis. Recent efforts have explored the integration of granular-ball (GB) computing with clustering algorithms to address this challenge, yielding promising results. However, existing methods for generating GBs often rely on single indicators to measure GB quality and employ threshold-based or greedy strategies, potentially leading to GBs that do not accurately capture the underlying data distribution. To address these limitations, this article introduces a novel GB generation method. The originality of this method lies in leveraging the principle of justifiable granularity to measure the quality of a GB for clustering tasks. To be precise, we define the coverage and specificity of a GB and introduce a comprehensive measure for assessing GB quality. Utilizing this quality measure, the method incorporates a binary tree pruning-based strategy and an anomaly detection method to determine the best combination of sub-GBs for each GB and identify abnormal GBs, respectively. Compared to previous GB generation methods, the new method maximizes the overall quality of generated GBs while ensuring alignment with the data distribution, thereby enhancing the rationality of the generated GBs. Experimental results obtained from both synthetic and publicly available datasets underscore the effectiveness of the proposed GB generation method, showcasing improvements in clustering accuracy and normalized mutual information.

algorithm, dataset, information granule, (12 more...)

arXiv.org Artificial Intelligence

2405.06904

Country:

Asia > China > Liaoning Province > Dalian (0.04)
North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)
Europe > Poland > Masovia Province > Warsaw (0.04)
(3 more...)

Genre: Research Report (0.82)

Industry: Education (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)

Add feedback

A Survey of Generative Techniques for Spatial-Temporal Data Mining

Zhang, Qianru, Wang, Haixin, Long, Cheng, Su, Liangcai, He, Xingwei, Chang, Jianlong, Wu, Tailin, Yin, Hongzhi, Yiu, Siu-Ming, Tian, Qi, Jensen, Christian S.

arXiv.org Artificial IntelligenceMay-15-2024

This paper focuses on the integration of generative techniques into spatial-temporal data mining, considering the significant growth and diverse nature of spatial-temporal data. With the advancements in RNNs, CNNs, and other non-generative techniques, researchers have explored their application in capturing temporal and spatial dependencies within spatial-temporal data. However, the emergence of generative techniques such as LLMs, SSL, Seq2Seq and diffusion models has opened up new possibilities for enhancing spatial-temporal data mining further. The paper provides a comprehensive analysis of generative technique-based spatial-temporal methods and introduces a standardized framework specifically designed for the spatial-temporal data mining pipeline. By offering a detailed review and a novel taxonomy of spatial-temporal methodology utilizing generative techniques, the paper enables a deeper understanding of the various techniques employed in this field. Furthermore, the paper highlights promising future research directions, urging researchers to delve deeper into spatial-temporal data mining. It emphasizes the need to explore untapped opportunities and push the boundaries of knowledge to unlock new insights and improve the effectiveness and efficiency of spatial-temporal data mining. By integrating generative techniques and providing a standardized framework, the paper contributes to advancing the field and encourages researchers to explore the vast potential of generative techniques in spatial-temporal data mining.

forecasting, learning, representation, (14 more...)

arXiv.org Artificial Intelligence

2405.09592

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > Singapore (0.04)
Asia > China > Sichuan Province > Chengdu (0.04)
(17 more...)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.46)

Industry:

Health & Medicine (1.00)
Energy (0.67)
Transportation > Ground > Road (0.46)
(2 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback

Dynamical systems and complex networks: A Koopman operator perspective

Klus, Stefan, Conrad, Nataša Djurdjevac

arXiv.org Machine LearningMay-14-2024

This perspective article is meant to be a self-contained introduction to and review of transfer operators such as the Koopman operator and the Perron-Frobenius operator as well as an overview of different applications. We will first introduce the required foundations and then show how transfer operators can not only be used to analyze highly nonlinear dynamical systems but also complex networks. In particular, we will focus on relationships between transfer operators for continuous-time stochastic processes defined on a continuous state space--whether they be reversible, non-reversible but time-homogeneous, or time-inhomogeneous--and their discrete counterparts associated with random walks on undirected, directed, and time-evolving graphs. Transfer operators play an important role in an increasing number of research fields. A few exemplary applications are illustrated in Figure 1.

artificial intelligence, machine learning, operator, (16 more...)

arXiv.org Machine Learning

2405.0894

Country:

Europe > Germany > Berlin (0.04)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report (0.50)
Instructional Material > Course Syllabus & Notes (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

Add feedback

Sum-of-norms clustering does not separate nearby balls

Dunlap, Alexander, Mourrat, Jean-Christophe

arXiv.org Artificial IntelligenceMay-14-2024

Sum-of-norms clustering is a popular convexification of K-means clustering. We show that, if the dataset is made of a large number of independent random variables distributed according to the uniform measure on the union of two disjoint balls of unit radius, and if the balls are sufficiently close to one another, then sum-of-norms clustering will typically fail to recover the decomposition of the dataset into two clusters. As the dimension tends to infinity, this happens even when the distance between the centers of the two balls is taken to be as large as 2 2. In order to show this, we introduce and analyze a continuous version of sum-of-norms clustering, where the dataset is replaced by a general measure. In particular, we state and prove a local-global characterization of the clustering that seems to be new even in the case of discrete datapoints.

probability, sgn, theorem 1, (15 more...)

arXiv.org Artificial Intelligence

2104.13753

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(11 more...)

Genre: Research Report (0.63)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Add feedback

DGCformer: Deep Graph Clustering Transformer for Multivariate Time Series Forecasting

Liu, Qinshuo, Fang, Yanwen, Jiang, Pengtao, Li, Guodong

arXiv.org Artificial IntelligenceMay-14-2024

Multivariate time series forecasting tasks are usually conducted in a channel-dependent (CD) way since it can incorporate more variable-relevant information. However, it may also involve a lot of irrelevant variables, and this even leads to worse performance than the channel-independent (CI) strategy. This paper combines the strengths of both strategies and proposes the Deep Graph Clustering Transformer (DGCformer) for multivariate time series forecasting. Specifically, it first groups these relevant variables by a graph convolutional network integrated with an autoencoder, and a former-latter masked self-attention mechanism is then considered with the CD strategy being applied to each group of variables while the CI one for different groups. Extensive experimental results on eight datasets demonstrate the superiority of our method against state-of-the-art models, and our code will be publicly available upon acceptance.

dataset, forecasting, module, (13 more...)

arXiv.org Artificial Intelligence

2405.0844

Country:

Asia > China > Hong Kong (0.04)
Oceania > New Zealand (0.04)
North America > Canada (0.04)
(6 more...)

Genre: Research Report > Promising Solution (0.48)

Industry:

Energy (1.00)
Health & Medicine > Epidemiology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Integrating supervised and unsupervised learning approaches to unveil critical process inputs

Papavasileiou, Paris, Giovanis, Dimitrios G., Pozzetti, Gabriele, Kathrein, Martin, Czettl, Christoph, Kevrekidis, Ioannis G., Boudouvis, Andreas G., Bordas, Stéphane P. A., Koronaki, Eleni D.

arXiv.org Artificial IntelligenceMay-13-2024

This study introduces a machine learning framework tailored to large-scale industrial processes characterized by a plethora of numerical and categorical inputs. The framework aims to (i) discern critical parameters influencing the output and (ii) generate accurate out-of-sample qualitative and quantitative predictions of production outcomes. Specifically, we address the pivotal question of the significance of each input in shaping the process outcome, using an industrial Chemical Vapor Deposition (CVD) process as an example. The initial objective involves merging subject matter expertise and clustering techniques exclusively on the process output, here, coating thickness measurements at various positions in the reactor. This approach identifies groups of production runs that share similar qualitative characteristics, such as film mean thickness and standard deviation. In particular, the differences of the outcomes represented by the different clusters can be attributed to differences in specific inputs, indicating that these inputs are critical for the production outcome. Leveraging this insight, we subsequently implement supervised classification and regression methods using the identified critical process inputs. The proposed methodology proves to be valuable in scenarios with a multitude of inputs and insufficient data for the direct application of deep learning techniques, providing meaningful insights into the underlying processes.

production run, reactor, surface area, (13 more...)

arXiv.org Artificial Intelligence

2405.07751

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.05)
North America > United States > Maryland > Baltimore (0.04)
(5 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Energy (0.93)
Semiconductors & Electronics (0.88)
Materials > Chemicals (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

DeepHYDRA: Resource-Efficient Time-Series Anomaly Detection in Dynamically-Configured Systems

Stehle, Franz Kevin, Vandelli, Wainer, Avolio, Giuseppe, Zahn, Felix, Fröning, Holger

arXiv.org Artificial IntelligenceMay-13-2024

Anomaly detection in distributed systems such as High-Performance Computing (HPC) clusters is vital for early fault detection, performance optimisation, security monitoring, reliability in general but also operational insights. Deep Neural Networks have seen successful use in detecting long-term anomalies in multidimensional data, originating for instance from industrial or medical systems, or weather prediction. A downside of such methods is that they require a static input size, or lose data through cropping, sampling, or other dimensionality reduction methods, making deployment on systems with variability on monitored data channels, such as computing clusters difficult. To address these problems, we present DeepHYDRA (Deep Hybrid DBSCAN/Reduction-Based Anomaly Detection) which combines DBSCAN and learning-based anomaly detection. DBSCAN clustering is used to find point anomalies in time-series data, mitigating the risk of missing outliers through loss of information when reducing input data to a fixed number of channels. A deep learning-based time-series anomaly detection method is then applied to the reduced data in order to identify long-term outliers. This hybrid approach reduces the chances of missing anomalies that might be made indistinguishable from normal data by the reduction process, and likewise enables the algorithm to be scalable and tolerate partial system failures while retaining its detection capabilities. Using a subset of the well-known SMD dataset family, a modified variant of the Eclipse dataset, as well as an in-house dataset with a large variability in active data channels, made publicly available with this work, we furthermore analyse computational intensity, memory footprint, and activation counts. DeepHYDRA is shown to reliably detect different types of anomalies in both large and complex datasets.

anomaly, dataset, detection, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3650200.3656637

2405.07749

Country:

Europe > Germany (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Government (0.46)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback