AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

A local approach to parameter space reduction for regression and classification tasks

Romor, Francesco, Tezzele, Marco, Rozza, Gianluigi

arXiv.org Machine LearningJul-22-2021

Frequently, the parameter space, chosen for shape design or other applications that involve the definition of a surrogate model, present subdomains where the objective function of interest is highly regular or well behaved. So, it could be approximated more accurately if restricted to those subdomains and studied separately. The drawback of this approach is the possible scarcity of data in some applications, but in those, where a quantity of data, moderately abundant considering the parameter space dimension and the complexity of the objective function, is available, partitioned or local studies are beneficial. In this work we propose a new method called local active subspaces (LAS), which explores the synergies of active subspaces with supervised clustering techniques in order to perform a more efficient dimension reduction in the parameter space for the design of accurate response surfaces. We also developed a procedure to exploit the local active subspace information for classification tasks. Using this technique as a preprocessing step onto the parameter space, or output space in case of vectorial outputs, brings remarkable results for the purpose of surrogate modelling.

active subspace, dimension, subspace, (16 more...)

arXiv.org Machine Learning

2107.10867

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Neural Ordinary Differential Equation Model for Evolutionary Subspace Clustering and Its Applications

Bai, Mingyuan, Choy, S. T. Boris, Zhang, Junping, Gao, Junbin

arXiv.org Artificial IntelligenceJul-22-2021

The neural ordinary differential equation (neural ODE) model has attracted increasing attention in time series analysis for its capability to process irregular time steps, i.e., data are not observed over equally-spaced time intervals. In multi-dimensional time series analysis, a task is to conduct evolutionary subspace clustering, aiming at clustering temporal data according to their evolving low-dimensional subspace structures. Many existing methods can only process time series with regular time steps while time series are unevenly sampled in many situations such as missing data. In this paper, we propose a neural ODE model for evolutionary subspace clustering to overcome this limitation and a new objective function with subspace self-expressiveness constraint is introduced. We demonstrate that this method can not only interpolate data at any time step for the evolutionary subspace clustering task, but also achieve higher accuracy than other state-of-the-art evolutionary subspace clustering methods. Both synthetic and real-world data are used to illustrate the efficacy of our proposed method.

node-escm, subspace, time step, (15 more...)

arXiv.org Artificial Intelligence

2107.10484

Country:

North America > United States (0.14)
Asia > South Korea (0.05)
Asia > Japan (0.05)
(20 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.93)
Law (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.91)

Add feedback

Learning Theorem Proving Components

Chvalovský, Karel, Jakubův, Jan, Olšák, Miroslav, Urban, Josef

arXiv.org Artificial IntelligenceJul-21-2021

Saturation-style automated theorem provers (ATPs) based on the given clause procedure are today the strongest general reasoners for classical first-order logic. The clause selection heuristics in such systems are, however, often evaluating clauses in isolation, ignoring other clauses. This has changed recently by equipping the E/ENIGMA system with a graph neural network (GNN) that chooses the next given clause based on its evaluation in the context of previously selected clauses. In this work, we describe several algorithms and experiments with ENIGMA, advancing the idea of contextual evaluation based on learning important components of the graph of clauses.

algorithm, evaluation, proof search, (17 more...)

arXiv.org Artificial Intelligence

2107.10034

Country:

Europe > Czechia > Prague (0.04)
Europe > Austria > Tyrol > Innsbruck (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(12 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

A Survey on Role-Oriented Network Embedding

Jiao, Pengfei, Guo, Xuan, Pan, Ting, Zhang, Wang, Pei, Yulong

arXiv.org Artificial IntelligenceJul-18-2021

Recently, Network Embedding (NE) has become one of the most attractive research topics in machine learning and data mining. NE approaches have achieved promising performance in various of graph mining tasks including link prediction and node clustering and classification. A wide variety of NE methods focus on the proximity of networks. They learn community-oriented embedding for each node, where the corresponding representations are similar if two nodes are closer to each other in the network. Meanwhile, there is another type of structural similarity, i.e., role-based similarity, which is usually complementary and completely different from the proximity. In order to preserve the role-based structural similarity, the problem of role-oriented NE is raised. However, compared to community-oriented NE problem, there are only a few role-oriented embedding approaches proposed recently. Although less explored, considering the importance of roles in analyzing networks and many applications that role-oriented NE can shed light on, it is necessary and timely to provide a comprehensive overview of existing role-oriented NE methods. In this review, we first clarify the differences between community-oriented and role-oriented network embedding. Afterwards, we propose a general framework for understanding role-oriented NE and a two-level categorization to better classify existing methods. Then, we select some representative methods according to the proposed categorization and briefly introduce them by discussing their motivation, development and differences. Moreover, we conduct comprehensive experiments to empirically evaluate these methods on a variety of role-related tasks including node classification and clustering (role discovery), top-k similarity search and visualization using some widely used synthetic and real-world datasets...

matrix, node, similarity, (15 more...)

arXiv.org Artificial Intelligence

2107.08379

Country:

North America > United States (0.14)
Asia > China > Tianjin Province > Tianjin (0.05)
South America > Brazil (0.04)
(5 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Education (0.67)
Transportation (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.87)

Add feedback

20 Data Science Interview Questions for a Beginner

#artificialintelligenceJul-17-2021, 09:10:39 GMT

Success is a process not an event. Data Science is growing rapidly in all sectors. With the availability of so many technologies within the Data Science domain, it becomes tricky to crack any Data Science interview. In this article, we have tried to cover the most common Data Science interview questions asked by recruiters. Answer: The question can also be phrased as to why linear regression is not a very effective algorithm.

classification, loss function, model performance, (15 more...)

#artificialintelligence

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.96)
(3 more...)

Add feedback

Genetic CFL: Optimization of Hyper-Parameters in Clustered Federated Learning

Agrawal, Shaashwat, Sarkar, Sagnik, Alazab, Mamoun, Maddikunta, Praveen Kumar Reddy, Gadekallu, Thippa Reddy, Pham, Quoc-Viet

arXiv.org Artificial IntelligenceJul-17-2021

Federated learning (FL) is a distributed model for deep learning that integrates client-server architecture, edge computing, and real-time intelligence. FL has the capability of revolutionizing machine learning (ML) but lacks in the practicality of implementation due to technological limitations, communication overhead, non-IID (independent and identically distributed) data, and privacy concerns. Training a ML model over heterogeneous non-IID data highly degrades the convergence rate and performance. The existing traditional and clustered FL algorithms exhibit two main limitations, including inefficient client training and static hyper-parameter utilization. To overcome these limitations, we propose a novel hybrid algorithm, namely genetic clustered FL (Genetic CFL), that clusters edge devices based on the training hyper-parameters and genetically modifies the parameters cluster-wise. Then, we introduce an algorithm that drastically increases the individual cluster accuracy by integrating the density-based clustering and genetic hyper-parameter optimization. The results are bench-marked using MNIST handwritten digit dataset and the CIFAR-10 dataset. The proposed genetic CFL shows significant improvements and works well with realistic cases of non-IID and ambiguous data.

algorithm, architecture, federated learning, (9 more...)

arXiv.org Artificial Intelligence

2107.07233

Country:

Asia > India (0.04)
Oceania > Australia (0.04)
Asia > South Korea > Busan > Busan (0.04)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A New Robust Multivariate Mode Estimator for Eye-tracking Calibration

Brilhault, Adrien, Neuenschwander, Sergio, Rios, Ricardo Araujo

arXiv.org Artificial IntelligenceJul-16-2021

We propose in this work a new method for estimating the main mode of multivariate distributions, with application to eye-tracking calibrations. When performing eye-tracking experiments with poorly cooperative subjects, such as infants or monkeys, the calibration data generally suffer from high contamination. Outliers are typically organized in clusters, corresponding to the time intervals when subjects were not looking at the calibration points. In this type of multimodal distributions, most central tendency measures fail at estimating the principal fixation coordinates (the first mode), resulting in errors and inaccuracies when mapping the gaze to the screen coordinates. Here, we developed a new algorithm to identify the first mode of multivariate distributions, named BRIL, which rely on recursive depth-based filtering. This novel approach was tested on artificial mixtures of Gaussian and Uniform distributions, and compared to existing methods (conventional depth medians, robust estimators of location and scatter, and clustering-based approaches). We obtained outstanding performances, even for distributions containing very high proportions of outliers, both grouped in clusters and randomly distributed. Finally, we demonstrate the strength of our method in a real-world scenario using experimental data from eye-tracking calibrations with Capuchin monkeys, especially for distributions where other algorithms typically lack accuracy.

main cluster, outlier, procedure, (12 more...)

arXiv.org Artificial Intelligence

2107.0803

Country:

North America > Canada > Ontario > Toronto (0.14)
South America > Brazil > Rio Grande do Norte > Natal (0.04)
South America > Brazil > Bahia > Salvador (0.04)
(6 more...)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.68)

Industry: Health & Medicine > Therapeutic Area (0.45)

Technology:

Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A multi-schematic classifier-independent oversampling approach for imbalanced datasets

Bej, Saptarshi, Schultz, Kristian, Srivastava, Prashant, Wolfien, Markus, Wolkenhauer, Olaf

arXiv.org Artificial IntelligenceJul-15-2021

Over 85 oversampling algorithms, mostly extensions of the SMOTE algorithm, have been built over the past two decades, to solve the problem of imbalanced datasets. However, it has been evident from previous studies that different oversampling algorithms have different degrees of efficiency with different classifiers. With numerous algorithms available, it is difficult to decide on an oversampling algorithm for a chosen classifier. Here, we overcome this problem with a multi-schematic and classifier-independent oversampling approach: ProWRAS(Proximity Weighted Random Affine Shadowsampling). ProWRAS integrates the Localized Random Affine Shadowsampling (LoRAS)algorithm and the Proximity Weighted Synthetic oversampling (ProWSyn) algorithm. By controlling the variance of the synthetic samples, as well as a proximity-weighted clustering system of the minority classdata, the ProWRAS algorithm improves performance, compared to algorithms that generate synthetic samples through modelling high dimensional convex spaces of the minority class. ProWRAS has four oversampling schemes, each of which has its unique way to model the variance of the generated data. Most importantly, the performance of ProWRAS with proper choice of oversampling schemes, is independent of the classifier used. We have benchmarked our newly developed ProWRAS algorithm against five sate-of-the-art oversampling models and four different classifiers on 20 publicly available datasets. ProWRAS outperforms other oversampling algorithms in a statistically significant way, in terms of both F1-score and Kappa-score. Moreover, we have introduced a novel measure for classifier independence I-score, and showed quantitatively that ProWRAS performs better, independent of the classifier used. In practice, ProWRAS customizes synthetic sample generation according to a classifier of choice and thereby reduces benchmarking efforts.

algorithm, classifier, dataset, (16 more...)

arXiv.org Artificial Intelligence

2107.07349

Country:

North America > United States > New York (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Freising (0.04)
(2 more...)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.47)

Industry: Education (0.67)

Technology:

Information Technology > Information Management (0.93)
Information Technology > Data Science > Data Mining (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Colour Quantization Using K-Means Clustering and OpenCV

#artificialintelligenceJul-13-2021, 01:20:53 GMT

Have you ever wondered how we can implement a machine learning algorithm on the pixel intensity value with a common K-means clustering algorithm? In this method, we would generate a compressed variant of our picture with more scattered colours. The image will be processed in a lower intensity resolution, whereas the fraction of pixels will prevail. This procedure is very interesting, so I expect that you will like it. This article can appear as a particularly impressive and unexpected one, so here is the link to the article, please have a read and hope you like it.

algorithm, colour quantization, k-means clustering and opencv

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Repo2Vec: A Comprehensive Embedding Approach for Determining Repository Similarity

Rokon, Md Omar Faruk, Yan, Pei, Islam, Risul, Faloutsos, Michalis

arXiv.org Artificial IntelligenceJul-11-2021

How can we identify similar repositories and clusters among a large online archive, such as GitHub? Determiningrepository similarity is an essential building block in studying the dynamics and the evolution of such software ecosystems. The key challenge is to determine the right representation for the diverse repository features in a way that: (a) it captures all aspects of the available information, and (b) it is readily usable by MLalgorithms. We propose Repo2Vec, a comprehensive embedding approach to represent a repository as a distributed vector by combining features from three types of information sources. As our key novelty, we consider three types of information: (a)metadata, (b) the structure of the repository, and (c) the source code. We also introduce a series of embedding approaches to represent and combine these information types into a single embedding. We evaluate our method with two real datasets from GitHub for a combined 1013 repositories. First, we show that our method outperforms previous methods in terms of precision (93%vs 78%), with nearly twice as many Strongly Similar repositories and 30% fewer False Positives. Second, we show how Repo2Vecprovides a solid basis for: (a) distinguishing between malware and benign repositories, and (b) identifying a meaningful hierarchical clustering. For example, we achieve 98% precision and 96%recall in distinguishing malware and benign repositories. Overall, our work is a fundamental building block for enabling many repository analysis functions such as repository categorization by target platform or intention, detecting code-reuse and clones, and identifying lineage and evolution.

repo2vec, repository, vector, (17 more...)

arXiv.org Artificial Intelligence

2107.05112

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
(5 more...)

Add feedback