Goto

Collaborating Authors

 Clustering


Nice Generalization of the K-NN Clustering Algorithm -- Also Useful for Data Reduction

@machinelearnbot

You don't need to know K-NN to understand this article -- but click here if you want to learn more about it. You don't need a background in statistical science either. Let's describe this new algorithm and its various components, in simple English We are dealing here with a supervised learning problem, and more specifically, clustering (also called supervised classification.). In particular, we want to assign a class label to a new observation that does not belong to the training set. Instead of checking out individual points (the nearest neighbors) and using a majority (voting) rule to assign the new observation to a cluster based on nearest neighbor counts, we are checking out cliques of points, and focus on the nearest cliques rather than on the nearest points. The cliques considered here are defined by circles (in two dimensions) or spheres (in three dimensions.)


Nice Generalization of the K-NN Clustering Algorithm -- Also Useful for Data Reduction

@machinelearnbot

You don't need to know K-NN to understand this article -- but click here if you want to learn more about it. You don't need a background in statistical science either. Let's describe this new algorithm and its various components, in simple English We are dealing here with a supervised learning problem, and more specifically, clustering (also called supervised classification.). In particular, we want to assign a class label to a new observation that does not belong to the training set. Instead of checking out individual points (the nearest neighbors) and using a majority (voting) rule to assign the new observation to a cluster based on nearest neighbor counts, we are checking out cliques of points, and focus on the nearest cliques rather than on the nearest points.


Comparing Distance Measurements with Python and SciPy

@machinelearnbot

Clustering, or cluster analysis, is used for analyzing data which does not include pre-labeled classes. Data instances are grouped together using the concept of maximizing intraclass similarity and minimizing the similarity between differing classes. This translates to the clustering algorithm identifying and grouping instances which are very similar, as opposed to ungrouped instances which are much less-similar to one another. As clustering does not require the pre-labeling of classes, it is a form of unsupervised learning. At the core of cluster analysis is the concept of measuring distances between a variety of different data point dimensions.


Machine Learning Applications in Credit Risk

#artificialintelligence

Typical decisions: • Grant credit/not to new applicants • Increasing/Decreasing spending limits • Increasing/Decreasing lending rates • What new products can be given to existing applicants? Step 2: Assign every entity to its closest medoid (using the distance matrix we have calculated). Step 3: For each cluster, identify the observation that would yield the lowest average distance if it were to be re-assigned as the medoid. If so, make this observation the new medoid. Step 4: If at least one medoid has changes, return to step 2. Otherwise, end the algorithm.


Mahalanonbis Distance Informed by Clustering

arXiv.org Machine Learning

A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored - which is the structure stemming from the relationships between the coordinates. Specifically we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space.We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan-Meier survival plot.


Model-Based Multiple Instance Learning

arXiv.org Machine Learning

While Multiple Instance (MI) data are point patterns -- sets or multi-sets of unordered points -- appropriate statistical point pattern models have not been used in MI learning. This article proposes a framework for model-based MI learning using point process theory. Likelihood functions for point pattern data derived from point process theory enable principled yet conceptually transparent extensions of learning tasks, such as classification, novelty detection and clustering, to point pattern data. Furthermore, tractable point pattern models as well as solutions for learning and decision making from point pattern data are developed.


Co-Clustering Can Provide Industrial Data Pattern Discovery

#artificialintelligence

In spite of the rapid development in data acquisition technology resulting in the explosive collection of acquired datasets, techniques such as data organization and classification, manipulation, and analysis of very large, diverse, heterogeneous datasets have only evolved modestly. This has led to hindrances in effective utility and better understanding of the acquired, large-scale data for knowledge discovery. In an industrial setting, an interesting visual from McKinsey illustrates that despite collecting data from tens of thousands of sensors, less than 1% is actually utilized. Data clustering is the classification of data objects into different groups (clusters) such that data objects in one group are similar together and dissimilar from another group. Typically, homogeneous data objects, i.e. data objects having the same data type, are grouped together using some of the well-known clustering algorithms.


Machine Learning: An In-Depth Guide – Unsupervised Learning, Related Fields, and Machine Learning in Practice

#artificialintelligence

Welcome to the fifth and final article in a five-part series about machine learning. In this final article, we will revisit unsupervised learning in greater depth, briefly discuss other fields related to machine learning, and finish the series with some examples of real-world machine learning applications. Recall that unsupervised learning involves learning from data, but without the goal of prediction. This is because the data is either not given with a target response variable (label), or one chooses not to designate a response. It can also be used as a pre-processing step for supervised learning.


Multilayer Spectral Graph Clustering via Convex Layer Aggregation: Theory and Algorithms

arXiv.org Machine Learning

Multilayer graphs provide a framework for representing multiple types of relations between entities, represented as nodes. In a multilayer graph each layer describes a specific type of relation among pairs of nodes that are shared across layers. For example, in multi-relational social networks, two layers might correspond to friendship relations and business relations, respectively. In temporal networks, each layer might correspond to a snapshot of the entire network at a sampled time instant. Multilayer graphs can be incorporated into in many signal processing and data mining techniques, including inference of mixture models [1], [2], tensor decomposition [3], information extraction [4], multi-view learning and processing [5], graph wavelet transforms [6], principal component analysis and dictionary learning [7], [8], anomaly detection [9], and community detection [10], [11], among others. The objective of multilayer graph clustering is to find a consensus cluster assignment on each node in the common node set by combining connectivity patterns in each layer.


[P] KMin - Clustering algorithm • r/MachineLearning

@machinelearnbot

In cases where an L1-norm or L-infinity norm better describe distance, this could be useful. For example, dealing with a square-grid pattern in city streets may yield better results when using scaled geographic coordinates. K-means is effectively an algorithm that considers all points around each cluster center to be distributed around that point according to an N-dimensional normal distribution with a constant diagonal and no correlations. This works well when your clusters can be approximated to be roughly a circular shape (which corresponds to the L2 norm of Euclidean space). If your cluster patterns were squares, cubes or hypercubes, this would work better for an L-infinity norm, and likewise diamond shapes would work better with an L1-norm.