AITopics

2407.11579

Country:

Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.05)
Europe > Spain > Galicia > Madrid (0.05)
North America > United States > New York (0.04)
(7 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Telecommunications (0.68)
Information Technology (0.68)
Health & Medicine (0.49)

Technology:

Information Technology > Communications (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.52)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

Ghasemloo, Mohammadmahdi, Eckman, David J.

Agglomerative Clustering of Simulation Output Distributions Using Regularized Wasserstein Distance

arXiv.org Machine LearningJul-16-2024

We investigate the use of clustering methods on data produced by a stochastic simulator, with applications in anomaly detection, pre-optimization, and online monitoring. We introduce an agglomerative clustering algorithm that clusters multivariate empirical distributions using the regularized Wasserstein distance and apply the proposed methodology on a call-center model.

agglomerative clustering, regularized wasserstein distance, simulation output distribution, (1 more...)

arXiv.org Machine Learning

2407.121

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Cass, Thomas, Gonon, Lukas, Zozoulenko, Nikita

Variance Norms for Kernelized Anomaly Detection

arXiv.org Machine LearningJul-16-2024

We present a unified theory for Mahalanobis-type anomaly detection on Banach spaces, using ideas from Cameron-Martin theory applied to non-Gaussian measures. This approach leads to a basis-free, data-driven notion of anomaly distance through the so-called variance norm of a probability measure, which can be consistently estimated using empirical measures. Our framework generalizes the classical $\mathbb{R}^d$, functional $(L^2[0,1])^d$, and kernelized settings, including the general case of non-injective covariance operator. We prove that the variance norm depends solely on the inner product in a given Hilbert space, and hence that the kernelized Mahalanobis distance can naturally be recovered by working on reproducing kernel Hilbert spaces. Using the variance norm, we introduce the notion of a kernelized nearest-neighbour Mahalanobis distance for semi-supervised anomaly detection. In an empirical study on 12 real-world datasets, we demonstrate that the kernelized nearest-neighbour Mahalanobis distance outperforms the traditional kernelized Mahalanobis distance for multivariate time series anomaly detection, using state-of-the-art time series kernels such as the signature, global alignment, and Volterra reservoir kernels. Moreover, we provide an initial theoretical justification of nearest-neighbour Mahalanobis distances by developing concentration inequalities in the finite-dimensional Gaussian case.

kernel, mahalanobis distance, variance norm, (14 more...)

arXiv.org Machine Learning

2407.11873

Country:

North America > United States > Texas > Dallas County > Dallas (0.04)
North America > United States > New York (0.04)
North America > United States > Nevada > Clark County > Las Vegas (0.04)
(4 more...)

Genre: Research Report (0.81)

Industry:

Information Technology (0.67)
Education (0.46)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Singh, Prabhant, Gijsbers, Pieter, Yildirim, Murat Onur, Gok, Elif Ceren, Vanschoren, Joaquin

CLAMS: A System for Zero-Shot Model Selection for Clustering

arXiv.org Artificial IntelligenceJul-15-2024

We propose an AutoML system that enables model selection on clustering problems by leveraging optimal transport-based dataset similarity. Our objective is to establish a comprehensive AutoML pipeline for clustering problems and provide recommendations for selecting the most suitable algorithms, thus opening up a new area of AutoML beyond the traditional supervised learning settings. We compare our results against multiple clustering baselines and find that it outperforms all of them, hence demonstrating the utility of similarity-based automated model selection for solving clustering applications.

algorithm, dataset, international conference, (13 more...)

2407.11286

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > New York > New York County > New York City (0.04)
Europe > Netherlands > North Brabant > Eindhoven (0.04)
Asia > China (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Therapeutic Area > Nephrology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Artificial IntelligenceJul-14-2024

A3S: A General Active Clustering Method with Pairwise Constraints

Deng, Xun, Liu, Junlong, Zhong, Han, Feng, Fuli, Shen, Chen, He, Xiangnan, Ye, Jieping, Wang, Zheng

Active clustering aims to boost the clustering performance by integrating human-annotated pairwise constraints through strategic querying. Conventional approaches with semi-supervised clustering schemes encounter high query costs when applied to large datasets with numerous classes. To address these limitations, we propose a novel Adaptive Active Aggregation and Splitting (A3S) framework, falling within the cluster-adjustment scheme in active clustering. A3S features strategic active clustering adjustment on the initial cluster result, which is obtained by an adaptive clustering algorithm. In particular, our cluster adjustment is inspired by the quantitative analysis of Normalized mutual information gain under the information theory framework and can provably improve the clustering quality. The proposed A3S framework significantly elevates the performance and scalability of active clustering. In extensive experiments across diverse real-world datasets, A3S achieves desired results with significantly fewer human queries compared with existing methods.

constraint, general active clustering method, query, (12 more...)

2407.10196

Country:

Europe > Austria > Vienna (0.14)
South America > Paraguay > Asunción > Asunción (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Artificial IntelligenceJul-13-2024

Deep Online Probability Aggregation Clustering

Yan, Yuxuan, Lu, Na, Yan, Ruofan

Combining machine clustering with deep models has shown remarkable superiority in deep clustering. It modifies the data processing pipeline into two alternating phases: feature clustering and model training. However, such alternating schedule may lead to instability and computational burden issues. We propose a centerless clustering algorithm called Probability Aggregation Clustering (PAC) to proactively adapt deep learning technologies, enabling easy deployment in online deep clustering. PAC circumvents the cluster center and aligns the probability space and distribution space by formulating clustering as an optimization problem with a novel objective function. Based on the computation mechanism of the PAC, we propose a general online probability aggregation module to perform stable and flexible feature clustering over mini-batch data and further construct a deep visual clustering framework deep PAC (DPAC). Extensive experiments demonstrate that PAC has superior clustering robustness and performance and DPAC remarkably outperforms the state-of-the-art deep clustering methods.

learning, pac, proceedings, (12 more...)

2407.05246

Country:

North America > United States > Oregon (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)

Genre: Research Report (0.40)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

URRL-IMVC: Unified and Robust Representation Learning for Incomplete Multi-View Clustering

Teng, Ge, Mao, Ting, Shen, Chen, Tian, Xiang, Liu, Xuesong, Chen, Yaowu, Ye, Jieping

Incomplete multi-view clustering (IMVC) aims to cluster multi-view data that are only partially available. This poses two main challenges: effectively leveraging multi-view information and mitigating the impact of missing views. Prevailing solutions employ cross-view contrastive learning and missing view recovery techniques. However, they either neglect valuable complementary information by focusing only on consensus between views or provide unreliable recovered views due to the absence of supervision. To address these limitations, we propose a novel Unified and Robust Representation Learning for Incomplete Multi-View Clustering (URRL-IMVC). URRL-IMVC directly learns a unified embedding that is robust to view missing conditions by integrating information from multiple views and neighboring samples. Firstly, to overcome the limitations of cross-view contrastive learning, URRL-IMVC incorporates an attention-based auto-encoder framework to fuse multi-view information and generate unified embeddings. Secondly, URRL-IMVC directly enhances the robustness of the unified embedding against view-missing conditions through KNN imputation and data augmentation techniques, eliminating the need for explicit missing view recovery. Finally, incremental improvements are introduced to further enhance the overall performance, such as the Clustering Module and the customization of the Encoder. We extensively evaluate the proposed URRL-IMVC framework on various benchmark datasets, demonstrating its state-of-the-art performance. Furthermore, comprehensive ablation studies are performed to validate the effectiveness of our design.

clustering, information, representation, (14 more...)

doi: 10.1145/3637528.3671887

2407.0912

Country:

North America > United States > New York > New York County > New York City (0.14)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.05)
Asia > China > Zhejiang Province > Hangzhou (0.05)
Europe > Finland > Uusimaa > Helsinki (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

BoBa: Boosting Backdoor Detection through Data Distribution Inference in Federated Learning

Wang, Ning, Shi, Shanghao, Xiao, Yang, Chen, Yimin, Hou, Y. Thomas, Lou, Wenjing

Federated learning, while being a promising approach for collaborative model training, is susceptible to poisoning attacks due to its decentralized nature. Backdoor attacks, in particular, have shown remarkable stealthiness, as they selectively compromise predictions for inputs containing triggers. Previous endeavors to detect and mitigate such attacks are based on the Independent and Identically Distributed (IID) data assumption where benign model updates exhibit high-level similarity in multiple feature spaces due to IID data. Thus, outliers are detected as backdoor attacks. Nevertheless, non-IID data presents substantial challenges in backdoor attack detection, as the data variety introduces variance among benign models, making outlier detection-based mechanisms less effective. We propose a novel distribution-aware anomaly detection mechanism, BoBa, to address this problem. In order to differentiate outliers arising from data variety versus backdoor attack, we propose to break down the problem into two steps: clustering clients utilizing their data distribution followed by a voting-based detection. Based on the intuition that clustering and subsequent backdoor detection can drastically benefit from knowing client data distributions, we propose a novel data distribution inference mechanism. To improve detection robustness, we introduce an overlapping clustering method, where each client is associated with multiple clusters, ensuring that the trustworthiness of a model update is assessed collectively by multiple clusters rather than a single cluster. Through extensive evaluations, we demonstrate that BoBa can reduce the attack success rate to lower than 0.001 while maintaining high main task accuracy across various attack strategies and experimental settings.

accuracy, data distribution, scenario, (16 more...)

2407.09658

Country:

Oceania > Australia (0.04)
North America > United States > Virginia (0.04)
North America > United States > Massachusetts (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.88)

Gupta, Shivam, Tarushi, null, Wangzes, Tsering, Jain, Shweta

Fair Federated Data Clustering through Personalization: Bridging the Gap between Diverse Data Distributions

The rapid growth of data from edge devices has catalyzed the performance of machine learning algorithms. However, the data generated resides at client devices thus there are majorly two challenge faced by traditional machine learning paradigms - centralization of data for training and secondly for most the generated data the class labels are missing and there is very poor incentives to clients to manually label their data owing to high cost and lack of expertise. To overcome these issues, there have been initial attempts to handle unlabelled data in a privacy preserving distributed manner using unsupervised federated data clustering. The goal is partition the data available on clients into $k$ partitions (called clusters) without actual exchange of data. Most of the existing algorithms are highly dependent on data distribution patterns across clients or are computationally expensive. Furthermore, due to presence of skewed nature of data across clients in most of practical scenarios existing models might result in clients suffering high clustering cost making them reluctant to participate in federated process. To this, we are first to introduce the idea of personalization in federated clustering. The goal is achieve balance between achieving lower clustering cost and at same time achieving uniform cost across clients. We propose p-FClus that addresses these goal in a single round of communication between server and clients. We validate the efficacy of p-FClus against variety of federated datasets showcasing it's data independence nature, applicability to any finite $\ell$-norm, while simultaneously achieving lower cost and variance.

algorithm, dataset, federated learning, (13 more...)

2407.04302

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.04)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
Oceania > Australia > Australian Capital Territory > Canberra (0.04)
(4 more...)

Genre:

Research Report (1.00)
Overview (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area (0.93)
Law Enforcement & Public Safety (0.67)
Transportation > Ground > Road (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Goncalves, Jorge da Silva, Manduchi, Laura, Vandenhirtz, Moritz, Vogt, Julia E.

Structured Generations: Using Hierarchical Clusters to guide Diffusion Models

This paper introduces Diffuse-TreeVAE, a deep generative model that integrates hierarchical clustering into the framework of Denoising Diffusion Probabilistic Models (DDPMs). The proposed approach generates new images by sampling from a root embedding of a learned latent tree VAE-based structure, it then propagates through hierarchical paths, and utilizes a second-stage DDPM to refine and generate distinct, high-quality images for each data cluster. The result is a model that not only improves image clarity but also ensures that the generated samples are representative of their respective clusters, addressing the limitations of previous VAE-based methods and advancing the state of clustering-based generative modeling.

diffuse-treevae model, hierarchical cluster, structured generation, (12 more...)

2407.06124

Country:

Europe > Austria > Vienna (0.14)
Europe > Switzerland > Zürich > Zürich (0.05)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > Canada > Alberta > Census Division No. 15 > Improvement District No. 9 > Banff (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)