AITopics

2302.0706

Country:

North America > Canada > Alberta (0.14)
Europe > Poland > Masovia Province > Warsaw (0.04)
Asia > Middle East > Saudi Arabia > Mecca Province > Jeddah (0.04)
(2 more...)

Genre:

Research Report (0.64)
Workflow (0.48)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Information Management (0.94)

arXiv.org Artificial IntelligenceFeb-14-2023

Multi-Prototypes Convex Merging Based K-Means Clustering Algorithm

Li, Dong, Zhou, Shuisheng, Zeng, Tieyong, Chan, Raymond H.

K-Means algorithm is a popular clustering method. However, it has two limitations: 1) it gets stuck easily in spurious local minima, and 2) the number of clusters k has to be given a priori. To solve these two issues, a multi-prototypes convex merging based K-Means clustering algorithm (MCKM) is presented. First, based on the structure of the spurious local minima of the K-Means problem, a multi-prototypes sampling (MPS) is designed to select the appropriate number of multi-prototypes for data with arbitrary shapes. A theoretical proof is given to guarantee that the multi-prototypes selected by MPS can achieve a constant factor approximation to the optimal cost of the K-Means problem. Then, a merging technique, called convex merging (CM), merges the multi-prototypes to get a better local minima without k being given a priori. Specifically, CM can obtain the optimal merging and estimate the correct k. By integrating these two techniques with K-Means algorithm, the proposed MCKM is an efficient and explainable clustering algorithm for escaping the undesirable local minima of K-Means problem without given k first. Experimental results performed on synthetic and real-world data sets have verified the effectiveness of the proposed algorithm.

algorithm, artificial intelligence, machine learning, (16 more...)

2302.07045

Country:

Asia > China > Hong Kong (0.04)
North America > United States > California (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)

Genre:

Research Report (0.64)
Workflow (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

C, Simo Alami., Kaddah, Rim, Read, Jesse

Transferable Deep Metric Learning for Clustering

arXiv.org Artificial IntelligenceFeb-13-2023

Clustering in high dimension spaces is a difficult task; the usual distance metrics may no longer be appropriate under the curse of dimensionality. Indeed, the choice of the metric is crucial, and it is highly dependent on the dataset characteristics. However a single metric could be used to correctly perform clustering on multiple datasets of different domains. We propose to do so, providing a framework for learning a transferable metric. We show that we can learn a metric on a labelled dataset, then apply it to cluster a different dataset, using an embedding space that characterises a desired clustering in the generic sense. We learn and test such metrics on several datasets of variable complexity (synthetic, MNIST, SVHN, omniglot) and achieve results competitive with the state-of-the-art while using only a small number of labelled training datasets and shallow networks.

artificial intelligence, dataset, machine learning, (15 more...)

2302.06523

Country: Europe > France (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Clare, Mariana C A, Warder, Simon C, Neal, Robert, Bhaskaran, B, Piggott, Matthew D

An unsupervised learning approach for predicting wind farm power and downstream wakes using weather patterns

arXiv.org Artificial IntelligenceFeb-12-2023

Wind energy resource assessment typically requires numerical models, but such models are too computationally intensive to consider multi-year timescales. Increasingly, unsupervised machine learning techniques are used to identify a small number of representative weather patterns to simulate long-term behaviour. Here we develop a novel wind energy workflow that for the first time combines weather patterns derived from unsupervised clustering techniques with numerical weather prediction models (here WRF) to obtain efficient and accurate long-term predictions of power and downstream wakes from an entire wind farm. We use ERA5 reanalysis data clustering not only on low altitude pressure but also, for the first time, on the more relevant variable of wind velocity. We also compare the use of large-scale and local-scale domains for clustering. A WRF simulation is run at each of the cluster centres and the results are aggregated using a novel post-processing technique. By applying our workflow to two different regions, we show that our long-term predictions agree with those from a year of WRF simulations but require less than 2% of the computational time. The most accurate results are obtained when clustering on wind velocity. Moreover, clustering over the Europe-wide domain is sufficient for predicting wind farm power output, but downstream wake predictions benefit from the use of smaller domains. Finally, we show that these downstream wakes can affect the local weather patterns. Our approach facilitates multi-year predictions of power output and downstream farm wakes, by providing a fast, accurate and flexible methodology that is applicable to any global region. Moreover, these accurate long-term predictions of downstream wakes provide the first tool to help mitigate the effects of wind energy loss downstream of wind farms, since they can be used to determine optimum wind farm locations.

artificial intelligence, machine learning, prediction, (15 more...)

2302.05886

Country:

Europe > Denmark (0.08)
Europe > United Kingdom > Scotland > Shetland (0.07)
Europe > North Sea (0.04)
(12 more...)

Genre: Workflow (1.00)

Industry: Energy > Renewable > Wind (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Artificial IntelligenceFeb-11-2023

Fairness-aware Multi-view Clustering

Zheng, Lecheng, Zhu, Yada, He, Jingrui

In the era of big data, we are often facing the challenge of data heterogeneity and the lack of label information simultaneously. In the financial domain (e.g., fraud detection), the heterogeneous data may include not only numerical data (e.g., total debt and yearly income), but also text and images (e.g., financial statement and invoice images). At the same time, the label information (e.g., fraud transactions) may be missing for building predictive models. To address these challenges, many state-of-the-art multi-view clustering methods have been proposed and achieved outstanding performance. However, these methods typically do not take into consideration the fairness aspect and are likely to generate biased results using sensitive information such as race and gender. Therefore, in this paper, we propose a fairness-aware multi-view clustering method named FairMVC. It incorporates the group fairness constraint into the soft membership assignment for each cluster to ensure that the fraction of different groups in each cluster is approximately identical to the entire data set. Meanwhile, we adopt the idea of both contrastive learning and non-contrastive learning and propose novel regularizers to handle heterogeneous data in complex scenarios with missing data or noisy features. Experimental results on real-world data sets demonstrate the effectiveness and efficiency of the proposed framework. We also derive insights regarding the relative performance of the proposed regularizers in various scenarios.

artificial intelligence, machine learning, sensitive feature, (14 more...)

2302.05788

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)
Asia > Taiwan (0.04)

Genre: Research Report (0.64)

Industry:

Information Technology (0.48)
Government (0.46)
Banking & Finance (0.30)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.87)

arXiv.org Artificial IntelligenceFeb-10-2023

Clustered Embedding Learning for Recommender Systems

Chen, Yizhou, Huzhang, Guangda, Zeng, Anxiang, Yu, Qingtao, Sun, Hui, Li, Heng-yi, Li, Jingyi, Ni, Yabo, Yu, Han, Zhou, Zhiming

In recent years, recommender systems have advanced rapidly, where embedding learning for users and items plays a critical role. A standard method learns a unique embedding vector for each user and item. However, such a method has two important limitations in real-world applications: 1) it is hard to learn embeddings that generalize well for users and items with rare interactions on their own; and 2) it may incur unbearably high memory costs when the number of users and items scales up. Existing approaches either can only address one of the limitations or have flawed overall performances. In this paper, we propose Clustered Embedding Learning (CEL) as an integrated solution to these two problems. CEL is a plug-and-play embedding learning framework that can be combined with any differentiable feature interaction model. It is capable of achieving improved performance, especially for cold users and items, with reduced memory cost. CEL enables automatic and dynamic clustering of users and items in a top-down fashion, where clustered entities jointly learn a shared embedding. The accelerated version of CEL has an optimal time complexity, which supports efficient online updates. Theoretically, we prove the identifiability and the existence of a unique optimal number of clusters for CEL in the context of nonnegative matrix factorization. Empirically, we validate the effectiveness of CEL on three public datasets and one business dataset, showing its consistently superior performance against current state-of-the-art methods. In particular, when incorporating CEL into the business model, it brings an improvement of $+0.6\%$ in AUC, which translates into a significant revenue gain; meanwhile, the size of the embedding table gets $2650$ times smaller.

artificial intelligence, cel, machine learning, (14 more...)

2302.01478

Country:

Asia > Singapore (0.14)
North America > United States > Texas > Travis County > Austin (0.05)
Asia > China > Shanghai > Shanghai (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)
Education > Educational Setting > Online (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.85)

Robinson, Michael, Altman, Tate, Lam, Denley, Li, Letitia W.

Unsupervised clustering of file dialects according to monotonic decompositions of mixtures

This paper proposes an unsupervised classification method that partitions a set of files into non-overlapping dialects based upon their behaviors, determined by messages produced by a collection of programs that consume them. The pattern of messages can be used as the signature of a particular kind of behavior, with the understanding that some messages are likely to co-occur, while others are not. Patterns of messages can be used to classify files into dialects. A dialect is defined by a subset of messages, called the required messages. Once files are conditioned upon dialect and its required messages, the remaining messages are statistically independent. With this definition of dialect in hand, we present a greedy algorithm that deduces candidate dialects from a dataset consisting of a matrix of file-message data, demonstrate its performance on several file formats, and prove conditions under which it is optimal. We show that an analyst needs to consider fewer dialects than distinct message patterns, which reduces their cognitive load when studying a complex format.

artificial intelligence, decomposition, machine learning, (19 more...)

2304.09082

Country:

North America > United States > Virginia > Arlington County > Arlington (0.04)
North America > United States > District of Columbia > Washington (0.04)
North America > United States > California (0.04)

Genre: Research Report (0.50)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.40)

Hybridization of K-means with improved firefly algorithm for automatic clustering in high dimension

Alam, Afroj

K-means Clustering is the most well-known partitioning algorithm among all clustering, by which we can partition the data objects very easily in to more than one clusters. However, for K-means to choose an appropriate number of clusters without any prior domain knowledge about the dataset is challenging, especially in high-dimensional data objects. Hence, we have implemented the Silhouette and Elbow methods with PCA to find an optimal number of clusters. Also, previously, so many meta-heuristic swarm intelligence algorithms inspired by nature have been employed to handle the automatic data clustering problem. Firefly is efficient and robust for automatic clustering. However, in the Firefly algorithm, the entire population is automatically subdivided into sub-populations that decrease the convergence rate speed and trapping to local minima in high-dimensional optimization problems. Thus, our study proposed an enhanced firefly, i.e., a hybridized K-means with an ODFA model for automatic clustering. The experimental part shows output and graphs of the Silhouette and Elbow methods as well as the Firefly algorithm

algorithm, artificial intelligence, machine learning, (16 more...)

2302.10765

Country:

Asia > India > Uttar Pradesh > Lucknow (0.04)
Asia > Singapore (0.04)
Africa > Mali (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Tanwisuth, Korawat, Zhang, Shujian, He, Pengcheng, Zhou, Mingyuan

A Prototype-Oriented Clustering for Domain Shift with Source Privacy

Unsupervised clustering under domain shift (UCDS) studies how to transfer the knowledge from abundant unlabeled data from multiple source domains to learn the representation of the unlabeled data in a target domain. In this paper, we introduce Prototype-oriented Clustering with Distillation (PCD) to not only improve the performance and applicability of existing methods for UCDS, but also address the concerns on protecting the privacy of both the data and model of the source domains. PCD first constructs a source clustering model by aligning the distributions of prototypes and data. It then distills the knowledge to the target model through cluster labels provided by the source model while simultaneously clustering the target data. Finally, it refines the target model on the target domain data without guidance from the source model. Experiments across multiple benchmarks show the effectiveness and generalizability of our source-private clustering method.

artificial intelligence, arxiv preprint arxiv, machine learning, (14 more...)

2302.03807

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Caville, Evan, Lo, Wai Weng, Layeghy, Siamak, Portmann, Marius

Anomal-E: A Self-Supervised Network Intrusion Detection System based on Graph Neural Networks

This paper investigates Graph Neural Networks (GNNs) application for self-supervised network intrusion and anomaly detection. GNNs are a deep learning approach for graph-based data that incorporate graph structures into learning to generalise graph representations and output embeddings. As network flows are naturally graph-based, GNNs are a suitable fit for analysing and learning network behaviour. The majority of current implementations of GNN-based Network Intrusion Detection Systems (NIDSs) rely heavily on labelled network traffic which can not only restrict the amount and structure of input traffic, but also the NIDSs potential to adapt to unseen attacks. To overcome these restrictions, we present Anomal-E, a GNN approach to intrusion and anomaly detection that leverages edge features and graph topological structure in a self-supervised process. This approach is, to the best our knowledge, the first successful and practical approach to network intrusion detection that utilises network flows in a self-supervised, edge leveraging GNN. Experimental results on two modern benchmark NIDS datasets not only clearly display the improvement of using Anomal-E embeddings rather than raw features, but also the potential Anomal-E has for detection on wild network traffic.

data mining, detection, machine learning, (20 more...)

doi: 10.1016/j.knosys.2022.110030

2207.06819

Country:

Oceania > Australia > Queensland > Brisbane (0.04)
North America > United States > Florida > Miami-Dade County > Coral Gables (0.04)
North America > United States > District of Columbia > Washington (0.04)
Asia (0.04)

Genre:

Overview (1.00)
Research Report > New Finding (0.67)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Communications > Networks (1.00)
(2 more...)