AITopics

2310.07491

Genre: Research Report (0.40)

Technology:

Information Technology > Modeling & Simulation (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.60)

Clark, Katharine M., McNicholas, Paul D.

Clustering Three-Way Data with Outliers

arXiv.org Machine LearningOct-11-2023

Matrix-variate normal mixture models are a powerful statistical tool used to represent complex data structures that involve matrices, such as multivariate time series, spatial data, and image data. Detecting outliers in matrix-variate normal mixture models is crucial for identifying anomalous observations that deviate significantly from the underlying distribution. Outliers can provide valuable insights into data quality issues, anomalies, or unexpected patterns. Outliers, and their treatment, is a long-studied topic in the field of applied statistics. The problem of handling outliers in multivariate clustering has been studied in several contexts including work by García-Escudero et al. (2008), Punzo and McNicholas (2016), Punzo et al. (2020), and Clark and McNicholas (2023).

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Machine Learning

2310.05288

Country:

Oceania > New Zealand (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
North America > Canada > Ontario > Hamilton (0.04)
Europe > Italy (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)

Anchor-based Multi-view Subspace Clustering with Hierarchical Feature Descent

Ou, Qiyuan, Wang, Siwei, Zhang, Pei, Zhou, Sihang, Zhu, En

anchor-based multi-view subspace clustering, hierarchical feature descent

Multi-view clustering has attracted growing attention owing to its capabilities of aggregating information from various sources and its promising horizons in public affairs. Up till now, many advanced approaches have been proposed in recent literature. However, there are several ongoing difficulties to be tackled. One common dilemma occurs while attempting to align the features of different views. We dig out as well as deploy the dependency amongst views through hierarchical feature descent, which leads to a common latent space( STAGE 1). This latent space, for the first time of its kind, is regarded as a 'resemblance space', as it reveals certain correlations and dependencies of different views. To be exact, the one-hot encoding of a category can also be referred to as a resemblance space in its terminal phase. Moreover, due to the intrinsic fact that most of the existing multi-view clustering algorithms stem from k-means clustering and spectral clustering, this results in cubic time complexity w.r.t. the number of the objects. However, we propose Anchor-based Multi-view Subspace Clustering with Hierarchical Feature Descent(MVSC-HFD) to further reduce the computing complexity to linear time cost through a unified sampling strategy in resemblance space( STAGE 2), followed by subspace clustering to learn the representation collectively( STAGE 3). Extensive experimental results on public benchmark datasets demonstrate that our proposed model consistently outperforms the state-of-the-art techniques.

2310.07166

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.53)

Saha, Jayasree, Mukherjee, Jayanta

IPD:An Incremental Prototype based DBSCAN for large-scale data with cluster representatives

DBSCAN is a fundamental density-based clustering technique that identifies any arbitrary shape of the clusters. However, it becomes infeasible while handling big data. On the other hand, centroid-based clustering is important for detecting patterns in a dataset since unprocessed data points can be labeled to their nearest centroid. However, it can not detect non-spherical clusters. For a large data, it is not feasible to store and compute labels of every samples. These can be done as and when the information is required. The purpose can be accomplished when clustering act as a tool to identify cluster representatives and query is served by assigning cluster labels of nearest representative. In this paper, we propose an Incremental Prototype-based DBSCAN (IPD) algorithm which is designed to identify arbitrary-shaped clusters for large-scale data. Additionally, it chooses a set of representatives for each cluster.

algorithm, dataset, prototype, (17 more...)

2202.0787

Country:

Europe > Netherlands > North Brabant > Eindhoven (0.04)
Asia > India > West Bengal > Kharagpur (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Mosayebi, R., Kia, H., Raki, A. Kianpour

A Supervised Embedding and Clustering Anomaly Detection method for classification of Mobile Network Faults

The paper introduces Supervised Embedding and Clustering Anomaly Detection (SEMC-AD), a method designed to efficiently identify faulty alarm logs in a mobile network and alleviate the challenges of manual monitoring caused by the growing volume of alarm logs. SEMC-AD employs a supervised embedding approach based on deep neural networks, utilizing historical alarm logs and their labels to extract numerical representations for each log, effectively addressing the issue of imbalanced classification due to a small proportion of anomalies in the dataset without employing one-hot encoding. The robustness of the embedding is evaluated by plotting the two most significant principle components of the embedded alarm logs, revealing that anomalies form distinct clusters with similar embeddings. Multivariate normal Gaussian clustering is then applied to these components, identifying clusters with a high ratio of anomalies to normal alarms (above 90%) and labeling them as the anomaly group. To classify new alarm logs, we check if their embedded vectors' two most significant principle components fall within the anomaly-labeled clusters. If so, the log is classified as an anomaly. Performance evaluation demonstrates that SEMC-AD outperforms conventional random forest and gradient boosting methods without embedding. SEMC-AD achieves 99% anomaly detection, whereas random forest and XGBoost only detect 86% and 81% of anomalies, respectively. While supervised classification methods may excel in labeled datasets, the results demonstrate that SEMC-AD is more efficient in classifying anomalies in datasets with numerous categorical features, significantly enhancing anomaly detection, reducing operator burden, and improving network maintenance.

anomaly, anomaly detection, dataset, (13 more...)

2310.06779

Country:

Oceania > Australia > New South Wales > Sydney (0.05)
Asia > Middle East > Iran > Tehran Province > Tehran (0.05)
North America > United States > Michigan > Oakland County > Rochester (0.04)
(2 more...)

Genre: Research Report > New Finding (0.35)

Industry: Telecommunications (0.30)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Remil, Youcef, Bendimerad, Anes, Mathonat, Romain, Raissi, Chedy, Kaytoue, Mehdi

DeepLSH: Deep Locality-Sensitive Hash Learning for Fast and Efficient Near-Duplicate Crash Report Detection

Automatic crash bucketing is a crucial phase in the software development process for efficiently triaging bug reports. It generally consists in grouping similar reports through clustering techniques. However, with real-time streaming bug collection, systems are needed to quickly answer the question: What are the most similar bugs to a new one?, that is, efficiently find near-duplicates. It is thus natural to consider nearest neighbors search to tackle this problem and especially the well-known locality-sensitive hashing (LSH) to deal with large datasets due to its sublinear performance and theoretical guarantees on the similarity search accuracy. Surprisingly, LSH has not been considered in the crash bucketing literature. It is indeed not trivial to derive hash functions that satisfy the so-called locality-sensitive property for the most advanced crash bucketing metrics. Consequently, we study in this paper how to leverage LSH for this task. To be able to consider the most relevant metrics used in the literature, we introduce DeepLSH, a Siamese DNN architecture with an original loss function, that perfectly approximates the locality-sensitivity property even for Jaccard and Cosine metrics for which exact LSH solutions exist. We support this claim with a series of experiments on an original dataset, which we make available.

hash function, similarity measure, stack trace, (10 more...)

2310.06703

Country:

Europe > France (0.04)
North America > United States > Nevada > Clark County > Las Vegas (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Automatic nodule identification and differentiation in ultrasound videos to facilitate per-nodule examination

Jiang, Siyuan, Ding, Yan, Wang, Yuling, Xu, Lei, Dai, Wenli, Chang, Wanru, Zhang, Jianfeng, Yu, Jie, Zhou, Jianqiao, Zhang, Chunquan, Liang, Ping, Kong, Dexing

Ultrasound is a vital diagnostic technique in health screening, with the advantages of non-invasive, cost-effective, and radiation free, and therefore is widely applied in the diagnosis of nodules. However, it relies heavily on the expertise and clinical experience of the sonographer. In ultrasound images, a single nodule might present heterogeneous appearances in different cross-sectional views which makes it hard to perform per-nodule examination. Sonographers usually discriminate different nodules by examining the nodule features and the surrounding structures like gland and duct, which is cumbersome and time-consuming. To address this problem, we collected hundreds of breast ultrasound videos and built a nodule reidentification system that consists of two parts: an extractor based on the deep learning model that can extract feature vectors from the input video clips and a real-time clustering algorithm that automatically groups feature vectors by nodules. The system obtains satisfactory results and exhibits the capability to differentiate ultrasound videos. As far as we know, it's the first attempt to apply re-identification technique in the ultrasonic field.

algorithm, nodule, tracklet, (16 more...)

2310.06339

Country:

Asia > China > Zhejiang Province > Hangzhou (0.04)
Asia > China > Jiangxi Province > Nanchang (0.04)
Asia > China > Shanghai > Shanghai (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.90)

Mateo-Gabín, Andrés, Tlales, Kenza, Valero, Eusebio, Ferrer, Esteban, Rubio, Gonzalo

An unsupervised machine-learning-based shock sensor for high-order supersonic flow solvers

arXiv.org Artificial IntelligenceOct-9-2023

We present a novel unsupervised machine-learning sock sensor based on Gaussian Mixture Models (GMMs). The proposed GMM sensor demonstrates remarkable accuracy in detecting shocks and is robust across diverse test cases with significantly less parameter tuning than other options. We compare the GMM-based sensor with state-of-the-art alternatives. All methods are integrated into a high-order compressible discontinuous Galerkin solver, where two stabilization approaches are coupled to the sensor to provide examples of possible applications. The Sedov blast and double Mach reflection cases demonstrate that our proposed sensor can enhance hybrid sub-cell flux-differencing formulations by providing accurate information of the nodes that require low-order blending. Besides, supersonic test cases including high Reynolds numbers showcase the sensor performance when used to introduce entropy-stable artificial viscosity to capture shocks, demonstrating the same effectiveness as fine-tuned state-of-the-art sensors. The adaptive nature and ability to function without extensive training datasets make this GMM-based sensor suitable for complex geometries and varied flow configurations. Our study reveals the potential of unsupervised machine-learning methods, exemplified by this GMM sensor, to improve the robustness and efficiency of advanced CFD codes.

artificial intelligence, machine learning, sensor, (19 more...)

2308.00086

Country:

Europe > United Kingdom (0.28)
Europe > Spain (0.14)
South America > Brazil > Rio de Janeiro (0.14)
North America > United States (0.14)

Genre: Research Report > New Finding (0.67)

Industry: Energy > Oil & Gas > Upstream (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

arXiv.org Artificial IntelligenceOct-9-2023

Find Your Optimal Assignments On-the-fly: A Holistic Framework for Clustered Federated Learning

Guo, Yongxin, Tang, Xiaoying, Lin, Tao

Federated Learning (FL) is an emerging distributed machine learning approach that preserves client privacy by storing data on edge devices. However, data heterogeneity among clients presents challenges in training models that perform well on all local distributions. Recent studies have proposed clustering as a solution to tackle client heterogeneity in FL by grouping clients with distribution shifts into different clusters. However, the diverse learning frameworks used in current clustered FL methods make it challenging to integrate various clustered FL methods, gather their benefits, and make further improvements. To this end, this paper presents a comprehensive investigation into current clustered FL methods and proposes a four-tier framework, namely HCFL, to encompass and extend existing approaches. Based on the HCFL, we identify the remaining challenges associated with current clustering methods in each tier and propose an enhanced clustering method called HCFL+ to address these challenges. Through extensive numerical evaluations, we showcase the effectiveness of our clustering framework and the improved components. Our code will be publicly available.

algorithm, federated learning, probability, (14 more...)

2310.05397

Country:

Asia > China (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.92)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Artificial IntelligenceOct-9-2023

FedRC: Tackling Diverse Distribution Shifts Challenge in Federated Learning by Robust Clustering

Guo, Yongxin, Tang, Xiaoying, Lin, Tao

Federated Learning (FL) is a machine learning paradigm that safeguards privacy by retaining client data on edge devices. However, optimizing FL in practice can be challenging due to the diverse and heterogeneous nature of the learning system. Though recent research has focused on improving the optimization of FL when distribution shifts occur among clients, ensuring global performance when multiple types of distribution shifts occur simultaneously among clients -- such as feature distribution shift, label distribution shift, and concept shift -- remain under-explored. In this paper, we identify the learning challenges posed by the simultaneous occurrence of diverse distribution shifts and propose a clustering principle to overcome these challenges. Through our research, we find that existing methods failed to address the clustering principle. Therefore, we propose a novel clustering algorithm framework, dubbed as FedRC, which adheres to our proposed clustering principle by incorporating a bi-level optimization problem and a novel objective function. Extensive experiments demonstrate that FedRC significantly outperforms other SOTA cluster-based FL methods. Our code will be publicly available.

algorithm, concept shift, fedrc, (13 more...)

2301.12379

Country:

Asia > China (0.04)
Oceania > Australia > New South Wales (0.04)
Europe > United Kingdom > Wales (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)