Goto

Collaborating Authors

 Clustering


EC3: Combining Clustering and Classification for Ensemble Learning

arXiv.org Machine Learning

Classification and clustering algorithms have been proved to be successful individually in different contexts. Both of them have their own advantages and limitations. For instance, although classification algorithms are more powerful than clustering methods in predicting class labels of objects, they do not perform well when there is a lack of sufficient manually labeled reliable data. On the other hand, although clustering algorithms do not produce label information for objects, they provide supplementary constraints (e.g., if two objects are clustered together, it is more likely that the same label is assigned to both of them) that one can leverage for label prediction of a set of unknown objects. Therefore, systematic utilization of both these types of algorithms together can lead to better prediction performance. In this paper, We propose a novel algorithm, called EC3 that merges classification and clustering together in order to support both binary and multi-class classification. EC3 is based on a principled combination of multiple classification and multiple clustering methods using an optimization function. We theoretically show the convexity and optimality of the problem and solve it by block coordinate descent method. We additionally propose iEC3, a variant of EC3 that handles imbalanced training data. We perform an extensive experimental analysis by comparing EC3 and iEC3 with 14 baseline methods (7 well-known standalone classifiers, 5 ensemble classifiers, and 2 existing methods that merge classification and clustering) on 13 standard benchmark datasets. We show that our methods outperform other baselines for every single dataset, achieving at most 10% higher AUC. Moreover our methods are faster (1.21 times faster than the best baseline), more resilient to noise and class imbalance than the best baseline method.


Faster Clustering via Non-Backtracking Random Walks

arXiv.org Machine Learning

This paper presents VEC-NBT, a variation on the unsupervised graph clustering technique VEC, which improves upon the performance of the original algorithm significantly for sparse graphs. VEC employs a novel application of the state-of-the-art word2vec model to embed a graph in Euclidean space via random walks on the nodes of the graph. In VEC-NBT, we modify the original algorithm to use a non-backtracking random walk instead of the normal backtracking random walk used in VEC. We introduce a modification to a non-backtracking random walk, which we call a begrudgingly-backtracking random walk, and show empirically that using this model of random walks for VEC-NBT requires shorter walks on the graph to obtain results with comparable or greater accuracy than VEC, especially for sparser graphs.


Robust Task Clustering for Deep Many-Task Learning

arXiv.org Machine Learning

We investigate task clustering for deep-learning based multi-task and few-shot learning in a many-task setting. We propose a new method to measure task similarities with cross-task transfer performance matrix for the deep learning scenario. Although this matrix provides us critical information regarding similarity between tasks, its asymmetric property and unreliable performance scores can affect conventional clustering methods adversely. Additionally, the uncertain task-pairs, i.e., the ones with extremely asymmetric transfer scores, may collectively mislead clustering algorithms to output an inaccurate task-partition. To overcome these limitations, we propose a novel task-clustering algorithm by using the matrix completion technique. The proposed algorithm constructs a partially-observed similarity matrix based on the certainty of cluster membership of the task-pairs. We then use a matrix completion algorithm to complete the similarity matrix. Our theoretical analysis shows that under mild constraints, the proposed algorithm will perfectly recover the underlying "true" similarity matrix with a high probability. Our results show that the new task clustering method can discover task clusters for training flexible and superior neural network models in a multi-task learning setup for sentiment classification and dialog intent classification tasks. Our task clustering approach also extends metric-based few-shot learning methods to adapt multiple metrics, which demonstrates empirical advantages when the tasks are diverse.


All the news โ€“ Andrew Thompson โ€“ Medium

@machinelearnbot

I recently curated this dataset to explore some algorithmic approximation of the categories that make up our news, a thing that at different times I have both read and created. If you had tens of thousands of articles from a spread of outlets that seem more or less representative of our national news landscape and you turned them into structured data, and you put a gun to that data's head and coerced it into groups, what would those groups be? I decided the best balance of simplicity and efficacy would be to use unsupervised clustering methods and let the data sort itself, however crudely (and categories, no matter what algorithm they're derived from, will almost always be crude, as there's no reason the media can't be infinitesimally taxonomized). For a variety of reasons (local memory constraints, ability, recommendations from those more learned), I chose to run a bag-of-words through KMeans -- in other words, if every word becomes its own dimension and each article a single datapoint, what clusters of articles will form? If those words already bore you and you're itching to skip to the "so what" and/or don't care about code, scroll down until you see bold letters telling you not to. The code is here if anyone wants to peer-review this and tell me if/where I screwed up and/or give me suggestions.


Dynamic Tensor Clustering

arXiv.org Machine Learning

Dynamic tensor data are becoming prevalent in numerous applications. Existing tensor clustering methods either fail to account for the dynamic nature of the data, or are inapplicable to a general-order tensor. Also there is often a gap between statistical guarantee and computational efficiency for existing tensor clustering solutions. In this article, we aim to bridge this gap by proposing a new dynamic tensor clustering method, which takes into account both sparsity and fusion structures, and enjoys strong statistical guarantees as well as high computational efficiency. Our proposal is based upon a new structured tensor factorization that encourages both sparsity and smoothness in parameters along the specified tensor modes. Computationally, we develop a highly efficient optimization algorithm that benefits from substantial dimension reduction. In theory, we first establish a non-asymptotic error bound for the estimator from the structured tensor factorization. Built upon this error bound, we then derive the rate of convergence of the estimated cluster centers, and show that the estimated clusters recover the true cluster structures with a high probability. Moreover, our proposed method can be naturally extended to co-clustering of multiple modes of the tensor data. The efficacy of our approach is illustrated via simulations and a brain dynamic functional connectivity analysis from an Autism spectrum disorder study.


GALILEO: A Generalized Low-Entropy Mixture Model

arXiv.org Machine Learning

We present a new method of generating mixture models for data with categorical attributes. The keys to this approach are an entropy-based density metric in categorical space and annealing of high-entropy/low-density components from an initial state with many components. Pruning of low-density components using the entropy-based density allows GALILEO to consistently find high-quality clusters and the same optimal number of clusters. GALILEO has shown promising results on a range of test datasets commonly used for categorical clustering benchmarks. We demonstrate that the scaling of GALILEO is linear in the number of records in the dataset, making this method suitable for very large categorical datasets.


Applications of Trajectory Data in Transportation: Literature Review and Maryland Case Study

arXiv.org Machine Learning

This paper considers applications of trajectory data in transportation, and makes two primary contributions. First, it provides a comprehensive literature review detailing ways in which trajectory data has been used for transportation systems analysis, distilling existing research into the following six areas: demand estimation, modeling human behavior, designing public transit, measuring and predicting traffic performance, quantifying environmental impact, and safety analysis. Additionally, it presents innovative applications of trajectory data for the state of Maryland, employing visualization and machine learning techniques to extract value from 20 million GPS traces. These visual analytics will be implemented in the Regional Integrated Transportation Information System (RITIS), which provides free data sharing and visual analytics tools to help transportation agencies attain situational awareness, evaluate performance, and share insights with the public.


Introduction to Clustering and Unsupervised Learning PACKT Books

#artificialintelligence

The act of clustering, or spotting patterns in data, is not much different from spotting patterns in groups of people. Before jumping into action, we'll begin by taking an in-depth look at exactly what clustering entails. Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groups of similar items. It does this without having been told how the groups should look ahead of time. As we may not even know what we're looking for, clustering is used for knowledge discovery rather than prediction. It provides an insight into the natural groupings found within data.


Predict the future with Machine Learning

#artificialintelligence

Machine Learning (ML) has some hefty gravitational force in the Software development world at the moment. But what exactly is it? In this post I'll take a top-down approach attempting to make it crystal clear, what it is, and what it can be used for in the real world. Machine Learning is a branch of Artificial Intelligence. Fundamentally it is Software that works like our brain, learning from information (data), then applying it to make smart decisions. Machine Learning algorithms can improve software (a robot) and it's ability to solve problems through gaining experience.


What is Machine Learning?

#artificialintelligence

Machine learning is perhaps the principal technology behind two emerging domains: data science and artificial intelligence. The rise of machine learning is coming about through the availability of data and computation, but machine learning methdologies are fundamentally dependent on models. The emergence of machine learning is closely tied to the emergence of widely available data. Large amounts of data and high interconnection bandwidth mean that we receive much of our information about the world around us through computers. Economists try to measure productivity, one of the ways we can become more productive is by becoming more efficient. For example, moving from gathering food to settled agriculture. In the modern era one approach to becoming more efficient is automation of processes like manufacturing production lines. The manufacturing process is decomposed into a series of mechanical or manual processes each of which is applied sequentially. Manufacturing processes consist of production lines and robotic automation. Logistics can also be decomposed into the supply chain processes. Whether it's manufacturing or logistics, efficiency can be improved by automating components of the processes to improve the flow of goods. An interesting challenge for modern society is the management of both the flow of goods and the flow of information. The flow of information is also highly automated. Processing of data is decomposed into stages in computer code. In these processing pipelines, manufacturing, logistics or data management, the overall pipeline normally also requires human intervention from an operator. These interventions can create bottlenecks and slow the process of automation. Machine learning is the key technology in automating these manual stages. The human interventions that were easy to replicate with technology have already been replaced. The components that still require human intervention are the knottier problems. Often they represent components that are difficult, or impossible, to decompose into stages which could then be further automated. In that sense these components are process-atoms. In manufacturing or logistics settings these atoms involve the sort of flexible manual skills that we cannot replicate with current robotic technology.