AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Feature-based Image Matching for Identifying Individual K\=ak\=a

O'Sullivan, Fintan, Escott, Kirita-Rose, Shaw, Rachael C., Lensen, Andrew

arXiv.org Artificial IntelligenceJan-23-2023

This report investigates an unsupervised, feature-based image matching pipeline for the novel application of identifying individual k\=ak\=a. Applied with a similarity network for clustering, this addresses a weakness of current supervised approaches to identifying individual birds which struggle to handle the introduction of new individuals to the population. Our approach uses object localisation to locate k\=ak\=a within images and then extracts local features that are invariant to rotation and scale. These features are matched between images with nearest neighbour matching techniques and mismatch removal to produce a similarity score for image match comparison. The results show that matches obtained via the image matching pipeline achieve high accuracy of true matches. We conclude that feature-based image matching could be used with a similarity network to provide a viable alternative to existing supervised approaches.

data mining, machine learning, pattern recognition, (22 more...)

arXiv.org Artificial Intelligence

2301.06678

Country:

Oceania > New Zealand > North Island > Hawke's Bay (0.04)
North America > Canada > British Columbia (0.04)
Africa > Zimbabwe (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (1.00)
(5 more...)

Add feedback

Victoria Amazonica Optimization (VAO): An Algorithm Inspired by the Giant Water Lily Plant

Mousavi, Seyed Muhammad Hossein

arXiv.org Artificial IntelligenceJan-22-2023

The Victoria Amazonica plant, often known as the Giant Water Lily, has the largest floating spherical leaf in the world, with a maximum leaf diameter of 3 meters. It spreads its leaves by the force of its spines and creates a large shadow underneath, killing any plants that require sunlight. These water tyrants use their formidable spines to compel each other to the surface and increase their strength to grab more space from the surface. As they spread throughout the pond or basin, with the earliest-growing leaves having more room to grow, each leaf gains a unique size. Its flowers are transsexual and when they bloom, Cyclocephala beetles are responsible for the pollination process, being attracted to the scent of the female flower. After entering the flower, the beetle becomes covered with pollen and transfers it to another flower for fertilization. After the beetle leaves, the flower turns into a male and changes color from white to pink. The male flower dies and sinks into the water, releasing its seed to help create a new generation. In this paper, the mathematical life cycle of this magnificent plant is introduced, and each leaf and blossom are treated as a single entity. The proposed bio-inspired algorithm is tested with 24 benchmark optimization test functions, such as Ackley, and compared to ten other famous algorithms, including the Genetic Algorithm. The proposed algorithm is tested on 10 optimization problems: Minimum Spanning Tree, Hub Location Allocation, Quadratic Assignment, Clustering, Feature Selection, Regression, Economic Dispatching, Parallel Machine Scheduling, Color Quantization, and Image Segmentation and compared to traditional and bio-inspired algorithms. Overall, the performance of the algorithm in all tasks is satisfactory.

artificial intelligence, evolutionary algorithm, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2303.0807

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > Singapore (0.04)
South America > Venezuela (0.04)
(11 more...)

Genre: Research Report (1.00)

Industry:

Energy > Power Industry (0.68)
Transportation (0.67)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
(3 more...)

Add feedback

A Semantic Modular Framework for Events Topic Modeling in Social Media

Moghaddam, Arya Hadizadeh, Momtazi, Saeedeh

arXiv.org Artificial IntelligenceJan-21-2023

The advancement of social media contributes to the growing amount of content they share frequently. This framework provides a sophisticated place for people to report various real-life events. Detecting these events with the help of natural language processing has received researchers' attention, and various algorithms have been developed for this goal. In this paper, we propose a Semantic Modular Model (SMM) consisting of 5 different modules, namely Distributional Denoising Autoencoder, Incremental Clustering, Semantic Denoising, Defragmentation, and Ranking and Processing. The proposed model aims to (1) cluster various documents and ignore the documents that might not contribute to the identification of events, (2) identify more important and descriptive keywords. Compared to the state-of-the-art methods, the results show that the proposed model has a higher performance in identifying events with lower ranks and extracting keywords for more important events in three English Twitter datasets: FACup, SuperTuesday, and USElection. The proposed method outperformed the best reported results in the mean keyword-precision metric by 7.9\%.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2301.09009

Country:

North America > United States (0.46)
Asia > Middle East > Jordan (0.04)
Asia > Middle East > Iran > Tehran Province > Tehran (0.04)

Genre:

Research Report > Promising Solution (0.48)
Research Report > New Finding (0.48)

Industry:

Government (0.68)
Media > News (0.34)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Auto-weighted Multi-view Clustering for Large-scale Data

Wan, Xinhang, Liu, Xinwang, Liu, Jiyuan, Wang, Siwei, Wen, Yi, Liang, Weixuan, Zhu, En, Liu, Zhe, Zhou, Lu

arXiv.org Artificial IntelligenceJan-20-2023

Multi-view clustering has gained broad attention owing to its capacity to exploit complementary information across multiple data views. Although existing methods demonstrate delightful clustering performance, most of them are of high time complexity and cannot handle large-scale data. Matrix factorization-based models are a representative of solving this problem. However, they assume that the views share a dimension-fixed consensus coefficient matrix and view-specific base matrices, limiting their representability. Moreover, a series of large-scale algorithms that bear one or more hyperparameters are impractical in real-world applications. To address the two issues, we propose an auto-weighted multi-view clustering (AWMVC) algorithm. Specifically, AWMVC first learns coefficient matrices from corresponding base matrices of different dimensions, then fuses them to obtain an optimal consensus matrix. By mapping original features into distinctive low-dimensional spaces, we can attain more comprehensive knowledge, thus obtaining better clustering results. Moreover, we design a six-step alternative optimization algorithm proven to be convergent theoretically. Also, AWMVC shows excellent performance on various benchmark datasets compared with existing ones. The code of AWMVC is publicly available at https://github.com/wanxinhang/AAAI-2023-AWMVC.

artificial intelligence, clustering, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2303.01983

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)
Asia > China > Hunan Province > Changsha (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Data Science (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)

Add feedback

Clustering Human Mobility with Multiple Spaces

Hu, Haoji, Lin, Haowen, Chiang, Yao-Yi

arXiv.org Artificial IntelligenceJan-20-2023

Human mobility clustering is an important problem for understanding human mobility behaviors (e.g., work and school commutes). Existing methods typically contain two steps: choosing or learning a mobility representation and applying a clustering algorithm to the representation. However, these methods rely on strict visiting orders in trajectories and cannot take advantage of multiple types of mobility representations. This paper proposes a novel mobility clustering method for mobility behavior detection. First, the proposed method contains a permutation-equivalent operation to handle sub-trajectories that might have different visiting orders but similar impacts on mobility behaviors. Second, the proposed method utilizes a variational autoencoder architecture to simultaneously perform clustering in both latent and original spaces. Also, in order to handle the bias of a single latent space, our clustering assignment prediction considers multiple learned latent spaces at different epochs. This way, the proposed method produces accurate results and can provide reliability estimates of each trajectory's cluster assignment. The experiment shows that the proposed method outperformed state-of-the-art methods in mobility behavior detection from trajectories with better accuracy and more interpretability.

artificial intelligence, data mining, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2301.08524

Country:

North America > United States > California (0.14)
North America > United States > Minnesota (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry:

Health & Medicine (0.46)
Education (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Identification, explanation and clinical evaluation of hospital patient subtypes

Werner, Enrico, Clark, Jeffrey N., Bhamber, Ranjeet S., Ambler, Michael, Bourdeaux, Christopher P., Hepburn, Alexander, McWilliams, Christopher J., Santos-Rodriguez, Raul

arXiv.org Artificial IntelligenceJan-19-2023

Patients admitted to hospital constitute a heterogeneous population with different levels of illness severity, morbidities, response to treatments and outcomes [9]. Therefore, predicting the right treatment is challenging even when patients are initially diagnosed with the same conditions. For diagnosis and determining treatment options, physicians rely on factors including the patient's medical history [6], their own clinical experience and their professional intuition [9]. Advances in computing technologies and the introduction of electrical health records (EHR) mean that more information is available to physicians than ever before. However, hospitals are still in the process of transitioning from paper records to EHR, which leads to challenges when analyzing the data and inferring high-level information [6]. As intensive care units (ICUs) are the most data-rich hospital department, machine learning approaches have mostly focused on these environments [27, 9, 3, 19]. Recent progress has also been made for general wards [8, 21, 15, 10]. Outcome prediction and risk scoring are of high clinical importance. Several risk scoring methods have been developed and deployed, e.g.

artificial intelligence, consciousness, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2301.08019

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > United Kingdom > England > Bristol (0.05)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
(5 more...)

Genre: Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Health Care Providers & Services (1.00)
(2 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

ClusterLog: Clustering Logs for Effective Log-based Anomaly Detection

Egersdoerfer, Chris, Dai, Dong, Zhang, Di

arXiv.org Artificial IntelligenceJan-18-2023

With the increasing prevalence of scalable file systems in the context of High Performance Computing (HPC), the importance of accurate anomaly detection on runtime logs is increasing. But as it currently stands, many state-of-the-art methods for log-based anomaly detection, such as DeepLog, have encountered numerous challenges when applied to logs from many parallel file systems (PFSes), often due to their irregularity and ambiguity in time-based log sequences. To circumvent these problems, this study proposes ClusterLog, a log pre-processing method that clusters the temporal sequence of log keys based on their semantic similarity. By grouping semantically and sentimentally similar logs, this approach aims to represent log sequences with the smallest amount of unique log keys, intending to improve the ability of a downstream sequence-based model to effectively learn the log patterns. The preliminary results of ClusterLog indicate not only its effectiveness in reducing the granularity of log sequences without the loss of important sequence information but also its generalizability to different file systems' logs.

artificial intelligence, data mining, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2301.07846

Country:

North America > United States > Missouri > Jackson County > Kansas City (0.14)
North America > United States > North Carolina > Mecklenburg County > Charlotte (0.04)
Asia > Middle East > Jordan (0.04)
(7 more...)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

No-substitution k-means Clustering with Adversarial Order

Bhattacharjee, Robi, Moshkovitz, Michal

arXiv.org Artificial IntelligenceJan-18-2023

We investigate $k$-means clustering in the online no-substitution setting when the input arrives in \emph{arbitrary} order. In this setting, points arrive one after another, and the algorithm is required to instantly decide whether to take the current point as a center before observing the next point. Decisions are irrevocable. The goal is to minimize both the number of centers and the $k$-means cost. Previous works in this setting assume that the input's order is random, or that the input's aspect ratio is bounded. It is known that if the order is arbitrary and there is no assumption on the input, then any algorithm must take all points as centers. Moreover, assuming a bounded aspect ratio is too restrictive -- it does not include natural input generated from mixture models. We introduce a new complexity measure that quantifies the difficulty of clustering a dataset arriving in arbitrary order. We design a new random algorithm and prove that if applied on data with complexity $d$, the algorithm takes $O(d\log(n) k\log(k))$ centers and is an $O(k^3)$-approximation. We also prove that if the data is sampled from a ``natural" distribution, such as a mixture of $k$ Gaussians, then the new complexity measure is equal to $O(k^2\log(n))$. This implies that for data generated from those distributions, our new algorithm takes only $\text{poly}(k\log(n))$ centers and is a $\text{poly}(k)$-approximation. In terms of negative results, we prove that the number of centers needed to achieve an $\alpha$-approximation is at least $\Omega\left(\frac{d}{k\log(n\alpha)}\right)$.

algorithm, artificial intelligence, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2012.14512

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > California (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.83)

Add feedback

Scalable Deep Graph Clustering with Random-walk based Self-supervised Learning

Li, Xiang, Li, Dong, Jin, Ruoming, Agrawal, Gagan, Ramnath, Rajiv

arXiv.org Artificial IntelligenceJan-17-2023

Web-based interactions can be frequently represented by an attributed graph, and node clustering in such graphs has received much attention lately. Multiple efforts have successfully applied Graph Convolutional Networks (GCN), though with some limits on accuracy as GCNs have been shown to suffer from over-smoothing issues. Though other methods (particularly those based on Laplacian Smoothing) have reported better accuracy, a fundamental limitation of all the work is a lack of scalability. This paper addresses this open problem by relating the Laplacian smoothing to the Generalized PageRank and applying a random-walk based algorithm as a scalable graph filter. This forms the basis for our scalable deep clustering algorithm, RwSL, where through a self-supervised mini-batch training mechanism, we simultaneously optimize a deep neural network for sample-cluster assignment distribution and an autoencoder for a clustering-oriented embedding. Using 6 real-world datasets and 6 clustering metrics, we show that RwSL achieved improved results over several recent baselines. Most notably, we show that RwSL, unlike all other deep clustering frameworks, can continue to scale beyond graphs with more than one million nodes, i.e., handle web-scale. We also demonstrate how RwSL could perform node clustering on a graph with 1.8 billion edges using only a single GPU.

artificial intelligence, graph, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2112.1553

Country: North America > United States > Ohio (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Pluto's Surface Mapping using Unsupervised Learning from Near-Infrared Observations of LEISA/Ralph

Emran, A., Ore, C. M. Dalle, Ahrens, C. J., Khan, M. K. H., Chevrier, V. F., Cruikshank, D. P.

arXiv.org Artificial IntelligenceJan-15-2023

We map the surface of Pluto using an unsupervised machine learning technique using the near-infrared observations of the LEISA/Ralph instrument onboard NASA's New Horizons spacecraft. The principal component reduced Gaussian mixture model was implemented to investigate the geographic distribution of the surface units across the dwarf planet. We also present the likelihood of each surface unit at the image pixel level. Average I/F spectra of each unit were analyzed -- in terms of the position and strengths of absorption bands of abundant volatiles such as N${}_{2}$, CH${}_{4}$, and CO and nonvolatile H${}_{2}$O -- to connect the unit to surface composition, geology, and geographic location. The distribution of surface units shows a latitudinal pattern with distinct surface compositions of volatiles -- consistent with the existing literature. However, previous mapping efforts were based primarily on compositional analysis using spectral indices (indicators) or implementation of complex radiative transfer models, which need (prior) expert knowledge, label data, or optical constants of representative endmembers. We prove that an application of unsupervised learning in this instance renders a satisfactory result in mapping the spatial distribution of ice compositions without any prior information or label data. Thus, such an application is specifically advantageous for a planetary surface mapping when label data are poorly constrained or completely unknown, because an understanding of surface material distribution is vital for volatile transport modeling at the planetary scale. We emphasize that the unsupervised learning used in this study has wide applicability and can be expanded to other planetary bodies of the Solar System for mapping surface material distribution.

artificial intelligence, machine learning, surface unit, (18 more...)

arXiv.org Artificial Intelligence

2301.06027

Country:

North America > United States > Arkansas (0.28)
North America > United States > California (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Energy > Oil & Gas > Upstream (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.90)

Add feedback