AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

The Ultimate Scikit-Learn Machine Learning Cheatsheet - KDnuggets

#artificialintelligenceMar-11-2021, 00:28:31 GMT

All images were created by the author unless explicitly stated otherwise. Train-test-split is an important part of testing how well a model performs by training it on designated training data and testing it on designated testing data. This way, the model's ability to generalize to new data can be measured. In sklearn, both lists, pandas DataFrames, or NumPy arrays are accepted in X and y parameters. Training a standard supervised learning model takes the form of an import, the creation of an instance, and the fitting of the model.

dependence plot, dimension, information, (11 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.31)

Add feedback

DynACPD Embedding Algorithm for Prediction Tasks in Dynamic Networks

Connell, Chris, Wang, Yang

arXiv.org Artificial IntelligenceMar-11-2021

Classical network embeddings create a low dimensional representation of the learned relationships between features across nodes. Such embeddings are important for tasks such as link prediction and node classification. In the current paper, we consider low dimensional embeddings of dynamic networks, that is a family of time varying networks where there exist both temporal and spatial link relationships between nodes. We present novel embedding methods for a dynamic network based on higher order tensor decompositions for tensorial representations of the dynamic network. In one sense, our embeddings are analogous to spectral embedding methods for static networks. We provide a rationale for our algorithms via a mathematical analysis of some potential reasons for their effectiveness. Finally, we demonstrate the power and efficiency of our approach by comparing our algorithms' performance on the link prediction task against an array of current baseline methods across three distinct real-world dynamic networks.

algorithm, decomposition, matrix, (13 more...)

arXiv.org Artificial Intelligence

2103.0708

Country:

North America > United States > Indiana > Monroe County > Bloomington (0.04)
Africa > Senegal > Kolda Region > Kolda (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
(3 more...)

Genre: Research Report (0.82)

Industry:

Health & Medicine (0.68)
Education (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

BIKED: A Dataset and Machine Learning Benchmarks for Data-Driven Bicycle Design

Regenwetter, Lyle, Curry, Brent, Ahmed, Faez

arXiv.org Machine LearningMar-9-2021

In this paper, we present "BIKED," a dataset comprised of 4500 individually designed bicycle models sourced from hundreds of designers. We expect BIKED to enable a variety of data-driven design applications for bicycles and generally support the development of data-driven design methods. The dataset is comprised of a variety of design information including assembly images, component images, numerical design parameters, and class labels. In this paper, we first discuss the processing of the dataset and present the various features provided. We then illustrate the scale, variety, and structure of the data using several unsupervised clustering studies. Next, we explore a variety of data-driven applications. We provide baseline classification performance for 10 algorithms trained on differing amounts of training data. We then contrast classification performance of three deep neural networks using parametric data, image data, and a combination of the two. Using one of the trained classification models, we conduct a Shapley Additive Explanations Analysis to better understand the extent to which certain design parameters impact classification predictions. Next, we test bike reconstruction and design synthesis using two Variational Autoencoders (VAEs) trained on images and parametric data. We furthermore contrast the performance of interpolation and extrapolation tasks in the original parameter space and the latent space of a VAE. Finally, we discuss some exciting possibilities for other applications beyond the few actively explored in this paper and summarize overall strengths and weaknesses of the dataset.

bike, dataset, parametric data, (16 more...)

arXiv.org Machine Learning

2103.05844

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > Canada > Ontario > National Capital Region > Ottawa (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Sports > Cycling (0.47)
Health & Medicine > Consumer Health (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

ALMA: Alternating Minimization Algorithm for Clustering Mixture Multilayer Network

Fan, Xing, Pensky, Marianna, Yu, Feng, Zhang, Teng

arXiv.org Machine LearningMar-8-2021

The paper considers a Mixture Multilayer Stochastic Block Model (MMLSBM), where layers can be partitioned into groups of similar networks, and networks in each group are equipped with a distinct Stochastic Block Model. The goal is to partition the multilayer network into clusters of similar layers, and to identify communities in those layers. Jing et al. (2020) introduced the MMLSBM and developed a clustering methodology, TWIST, based on regularized tensor decomposition. The present paper proposes a different technique, an alternating minimization algorithm (ALMA), that aims at simultaneous recovery of the layer partition, together with estimation of the matrices of connection probabilities of the distinct layers. Compared to TWIST, ALMA achieves higher accuracy both theoretically and numerically.

assumption, iter, matrix, (14 more...)

arXiv.org Machine Learning

2102.10226

Country:

North America > United States > Florida > Orange County > Orlando (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > New York (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Cluster-based Input Weight Initialization for Echo State Networks

Steiner, Peter, Jalalvand, Azarakhsh, Birkholz, Peter

arXiv.org Artificial IntelligenceMar-8-2021

Echo State Networks (ESNs) are a special type of recurrent neural networks (RNNs), in which the input and recurrent connections are traditionally generated randomly, and only the output weights are trained. Despite the recent success of ESNs in various tasks of audio, image and radar recognition, we postulate that a purely random initialization is not the ideal way of initializing ESNs. The aim of this work is to propose an unsupervised initialization of the input connections using the K-Means algorithm on the training data. We show that this initialization performs equivalently or superior than a randomly initialized ESN whilst needing significantly less reservoir neurons (2000 vs. 4000 for spoken digit recognition, and 300 vs. 8000 neurons for f0 extraction) and thus reducing the amount of training time. Furthermore, we discuss that this approach provides the opportunity to estimate the suitable size of the reservoir based on the prior knowledge about the data.

esn, km-esn, neuron, (12 more...)

arXiv.org Artificial Intelligence

2103.0471

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Florida > Broward County > Fort Lauderdale (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Nishimori meets Bethe: a spectral method for node classification in sparse weighted graphs

Dall'Amico, Lorenzo, Couillet, Romain, Tremblay, Nicolas

arXiv.org Machine LearningMar-5-2021

This article unveils a new relation between the Nishimori temperature parametrizing a distribution P and the Bethe free energy on random Erdos-Renyi graphs with edge weights distributed according to P. Estimating the Nishimori temperature being a task of major importance in Bayesian inference problems, as a practical corollary of this new relation, a numerical method is proposed to accurately estimate the Nishimori temperature from the eigenvalues of the Bethe Hessian matrix of the weighted graph. The algorithm, in turn, is used to propose a new spectral method for node classification in weighted (possibly sparse) graphs. The superiority of the method over competing state-of-the-art approaches is demonstrated both through theoretical arguments and real-world data experiments.

eigenvalue, graph, nishimori temperature, (17 more...)

arXiv.org Machine Learning

2103.03561

Country:

Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Riskyishness and Pinocchio's Search for a Comprehensive Taxonomy of Autonomous Entities

Wagner, William P. IV, Źakowska, Anna, Aladi, Clement, Santhosh, Joseph

arXiv.org Artificial IntelligenceMar-5-2021

This paper documents an exploratory pilot study to define the term Autonomous Entity, and any characteristics that are required to identify or classify an Autonomous Entity. Our solution builds on previous work with regard to philosophical and scientific classification methods but focuses on a novel Design Science Research Methodology (DSRM) and model to help identify those characteristics which make any autonomous entity similar or different from others. We have solved the problem of not having an existing term to define our lens by creating a new combinatorial term: "Riskyishness". We present a DSRM and instrument for initial investigation, as well as observational and statistical descriptions of their use in the real world to solicit domain expertise and statistical evidence. Further, we demonstrate a specific application of the methodology by creating a second artifact - a tool to score existing and future technologies based on Riskyishness. The first artifact also provides a technique to disentangle miscellaneous existing technologies or add dimensions to the tools to capture future additions and paradigm shifts.

dimension, helpful desc, taxonomy, (13 more...)

arXiv.org Artificial Intelligence

2103.03482

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > California > Ventura County > Thousand Oaks (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Robots (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Vicinal and categorical domain adaptation

Tang, Hui, Jia, Kui

arXiv.org Machine LearningMar-4-2021

Unsupervised domain adaptation aims to learn a task classifier that performs well on the unlabeled target domain, by utilizing the labeled source domain. Inspiring results have been acquired by learning domain-invariant deep features via domain-adversarial training. However, its parallel design of task and domain classifiers limits the ability to achieve a finer category-level domain alignment. To promote categorical domain adaptation (CatDA), based on a joint category-domain classifier, we propose novel losses of adversarial training at both domain and category levels. Since the joint classifier can be regarded as a concatenation of individual task classifiers respectively for the two domains, our design principle is to enforce consistency of category predictions between the two task classifiers. Moreover, we propose a concept of vicinal domains whose instances are produced by a convex combination of pairs of instances respectively from the two domains. Intuitively, alignment of the possibly infinite number of vicinal domains enhances that of original domains. We propose novel adversarial losses for vicinal domain adaptation (VicDA) based on CatDA, leading to Vicinal and Categorical Domain Adaptation (ViCatDA). We also propose Target Discriminative Structure Recovery (TDSR) to recover the intrinsic target discrimination damaged by adversarial feature alignment. We also analyze the principles underlying the ability of our key designs to align the joint distributions. Extensive experiments on several benchmark datasets demonstrate that we achieve the new state of the art.

adaptation, classifier, domain adaptation, (16 more...)

arXiv.org Machine Learning

doi: 10.1016/j.patcog.2021.107907

2103.0346

Country:

Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Singapore (0.04)
Asia > China > Guangdong Province > Guangzhou (0.04)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Clustering multilayer graphs with missing nodes

Braun, Guillaume, Tyagi, Hemant, Biernacki, Christophe

arXiv.org Machine LearningMar-4-2021

Relationship between agents can be conveniently represented by graphs. When these relationships have different modalities, they are better modelled by multilayer graphs where each layer is associated with one modality. Such graphs arise naturally in many contexts including biological and social networks. Clustering is a fundamental problem in network analysis where the goal is to regroup nodes with similar connectivity profiles. In the past decade, various clustering methods have been extended from the unilayer setting to multilayer graphs in order to incorporate the information provided by each layer. While most existing works assume - rather restrictively - that all layers share the same set of nodes, we propose a new framework that allows for layers to be defined on different sets of nodes. In particular, the nodes not recorded in a layer are treated as missing. Within this paradigm, we investigate several generalizations of well-known clustering methods in the complete setting to the incomplete one and prove some consistency results under the Multi-Layer Stochastic Block Model assumption. Our theoretical results are complemented by thorough numerical comparisons between our proposed algorithms on synthetic data, and also on real datasets, thus highlighting the promising behaviour of our methods in various settings.

matrix, node, probability, (14 more...)

arXiv.org Machine Learning

2103.03235

Country:

North America > United States (0.28)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.86)

Add feedback

Chemistry-informed Macromolecule Graph Representation for Similarity Computation and Supervised Learning

Mohapatra, Somesh, An, Joyce, Gómez-Bombarelli, Rafael

arXiv.org Machine LearningMar-3-2021

Macromolecules are large, complex molecules composed of covalently bonded monomer units, existing in different stereochemical configurations and topologies. As a result of such chemical diversity, representing, comparing, and learning over macromolecules emerge as critical challenges. To address this, we developed a macromolecule graph representation, with monomers and bonds as nodes and edges, respectively. We captured the inherent chemistry of the macromolecule by using molecular fingerprints for node and edge attributes. For the first time, we demonstrated computation of chemical similarity between 2 macromolecules of varying chemistry and topology, using exact graph edit distances and graph kernels. We also trained graph neural networks for a variety of glycan classification tasks, achieving state-of-the-art results. Our work has two-fold implications - it provides a general framework for representation, comparison, and learning of macromolecules; and enables quantitative chemistry-informed decision-making and iterative design in the macromolecular chemical space. Macromolecules are ubiquitous and indispensable, from constituting what we are made up of to being present in almost everything we use. As biological macromolecules, they form the basis of life, serving as drivers of survival and growth functions. As synthetic macromolecules, humans have engineered the composition and topology to design structural components, sensors, shape-memory materials, drugs, encode messages, and much more (Lutz et al., 2016; Romio et al., 2020; Boydston et al., 2020; Thompson & Korley, 2020).

glycan, model architecture, roc-auc curve, (15 more...)

arXiv.org Machine Learning

2103.02565

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > California > Los Angeles County > Pasadena (0.04)
Europe > France (0.04)

Genre: Research Report (0.40)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback