Goto

Collaborating Authors

 Clustering


Cluster-based trajectory segmentation with local noise

arXiv.org Artificial Intelligence

We present a framework for the partitioning of a spatial trajectory in a sequence of segments based on spatial density and temporal criteria. The result is a set of temporally separated clusters interleaved by sub-sequences of unclustered points. A major novelty is the proposal of an outlier or noise model based on the distinction between intra-cluster (local noise) and inter-cluster noise (transition): the local noise models the temporary absence from a residence while the transition the definitive departure towards a next residence. We analyze in detail the properties of the model and present a comprehensive solution for the extraction of temporally ordered clusters. The effectiveness of the solution is evaluated first qualitatively and next quantitatively by contrasting the segmentation with ground truth. The ground truth consists of a set of trajectories of labeled points simulating animal movement. Moreover, we show that the approach can streamline the discovery of additional derived patterns, by presenting a novel technique for the analysis of periodic movement. From a methodological perspective, a valuable aspect of this research is that it combines the theoretical investigation with the application and external validation of the segmentation framework. This paves the way to an effective deployment of the solution in broad and challenging fields such as e-science.


Using Quantum Mechanics to Cluster Time Series

arXiv.org Machine Learning

In this article we present a method by which we can reduce a time series into a single point in $\mathbb{R}^{13}$. We have chosen 13 dimensions so as to prevent too many points from being labeled as "noise." When using a Euclidean (or Mahalanobis) metric, a simple clustering algorithm will with near certainty label the majority of points as "noise." On pure physical considerations, this is not possible. Included in our 13 dimensions are four parameters which describe the coefficients of a cubic polynomial attached to a Gaussian picking up a general trend, four parameters picking up periodicity in a time series, two each for amplitude of a wave and period of a wave, and the final five report the "leftover" noise of the detrended and aperiodic time series. Of the final five parameters, four are the centralized probabilistic moments, and the final for the relative size of the series. The first main contribution of this work is to apply a theorem of quantum mechanics about the completeness of the solutions to the quantum harmonic oscillator on $L^2(\mathbb{R})$ to estimating trends in time series. The second main contribution is the method of fitting parameters. After many numerical trials, we realized that methods such a Newton-Rhaphson and Levenberg-Marquardt converge extremely fast if the initial guess is good. Thus we guessed many initial points in our parameter space and computed only a few iterations, a technique common in Keogh's work on time series clustering. Finally, we have produced a model which gives incredibly accurate results quickly. We ackowledge that there are faster methods as well of more accurate methods, but this work shows that we can still increase computation speed with little, if any, cost to accuracy in the sense of data clustering.


Residential Transformer Overloading Risk Assessment Using Clustering Analysis

arXiv.org Artificial Intelligence

Residential transformer population is a critical type of asset that many electric utility companies have been attempting to manage proactively and effectively to reduce unexpected failures and life losses that are often caused by transformer overloading. Within the typical power asset portfolio, the residential transformer asset is often large in population, having lowest reliability design, lacking transformer loading data and susceptible to customer loading behaviors such as adoption of distributed energy resources and electric vehicles. On the bright side, the availability of more residential operation data along with the advancement of data analytics techniques have provided a new path to further our understanding of local residential transformer overloading risks statistically. This research developed a new data-driven method to combine clustering analysis and the simulation of transformer temperature rise and insulation life loss to quantitatively and statistically assess the overloading risk of residential transformer population in one area and suggest proper risk management measures according to the assessment results. Case studies from an actual Canadian utility company have been presented and discussed in detail to demonstrate the applicability and usefulness of the proposed method.


Spectral clustering algorithms for the detection of clusters in block-cyclic and block-acyclic graphs

arXiv.org Machine Learning

We propose two spectral algorithms for partitioning nodes in directed graphs respectively with a cyclic and an acyclic pattern of connection between groups of nodes. Our methods are based on the computation of extremal eigenvalues of the transition matrix associated to the directed graph. The two algorithms outperform state-of-the art methods for directed graph clustering on synthetic datasets, including methods based on blockmodels, bibliometric symmetrization and random walks. Our algorithms have the same space complexity as classical spectral clustering algorithms for undirected graphs and their time complexity is also linear in the number of edges in the graph. One of our methods is applied to a trophic network based on predator-prey relationships. It successfully extracts common categories of preys and predators encountered in food chains. The same method is also applied to highlight the hierarchical structure of a worldwide network of Autonomous Systems depicting business agreements between Internet Service Providers.


COBRAS-TS: A new approach to Semi-Supervised Clustering of Time Series

arXiv.org Machine Learning

Clustering is ubiquitous in data analysis, including analysis of time series. It is inherently subjective: different users may prefer different clusterings for a particular dataset. Semi-supervised clustering addresses this by allowing the user to provide examples of instances that should (not) be in the same cluster. This paper studies semi-supervised clustering in the context of time series. We show that COBRAS, a state-of-the-art semi-supervised clustering method, can be adapted to this setting. We refer to this approach as COBRAS-TS. An extensive experimental evaluation supports the following claims: (1) COBRAS-TS far outperforms the current state of the art in semi-supervised clustering for time series, and thus presents a new baseline for the field; (2) COBRAS-TS can identify clusters with separated components; (3) COBRAS-TS can identify clusters that are characterized by small local patterns; (4) a small amount of semi-supervision can greatly improve clustering quality for time series; (5) the choice of the clustering algorithm matters (contrary to earlier claims in the literature).


80. Grouping unlabelled data with k-means clustering

#artificialintelligence

Sometimes we may have prior knowledge that we want to group the data into a given number of clusters. Other times we may wish to investigate what may be a good number of clusters. In the example below we look at changing the number of clusters between 1 and 100 and measure the average distance points are from their closest cluster centre (kmeans.transform Looking at the results we may decide that up to about 10 clusters may be useful, but after that there are diminishing returns of adding further clusters.



Non-Intrusive Signature Extraction for Major Residential Loads

arXiv.org Artificial Intelligence

The data collected by smart meters contain a lot of useful information. One potential use of the data is to track the energy consumptions and operating statuses of major home appliances.The results will enable homeowners to make sound decisions on how to save energy and how to participate in demand response programs. This paper presents a new method to breakdown the total power demand measured by a smart meter to those used by individual appliances. A unique feature of the proposed method is that it utilizes diverse signatures associated with the entire operating window of an appliance for identification. As a result, appliances with complicated middle process can be tracked. A novel appliance registration device and scheme is also proposed to automate the creation of appliance signature database and to eliminate the need of massive training before identification. The software and system have been developed and deployed to real houses in order to verify the proposed method.


Clustering Meets Implicit Generative Models

arXiv.org Machine Learning

Clustering is a cornerstone of unsupervised learning which can be thought as disentangling multiple generative mechanisms underlying the data. In this paper we introduce an algorithmic framework to train mixtures of implicit generative models which we particularize for variational autoencoders. Relying on an additional set of discriminators, we propose a competitive procedure in which the models only need to approximate the portion of the data distribution from which they can produce realistic samples. As a byproduct, each model is simpler to train, and a clustering interpretation arises naturally from the partitioning of the training points among the models. We empirically show that our approach splits the training distribution in a reasonable way and increases the quality of the generated samples.


Clustrophile 2: Guided Visual Clustering Analysis

arXiv.org Artificial Intelligence

Data clustering is a common unsupervised learning method frequently used in exploratory data analysis. However, identifying relevant structures in unlabeled, high-dimensional data is nontrivial, requiring iterative experimentation with clustering parameters as well as data features and instances. The space of possible clusterings for a typical dataset is vast, and navigating in this vast space is also challenging. The absence of ground-truth labels makes it impossible to define an optimal solution, thus requiring user judgment to establish what can be considered a satisfiable clustering result. Data scientists need adequate interactive tools to effectively explore and navigate the large space of clusterings so as to improve the effectiveness of exploratory clustering analysis. We introduce \textit{Clustrophile 2}, a new interactive tool for guided clustering analysis. \textit{Clustrophile 2} guides users in clustering-based exploratory analysis, adapts user feedback to improve user guidance, facilitates the interpretation of clusters, and helps quickly reason about differences between clusterings. To this end, \textit{Clustrophile 2} contributes a novel feature, the clustering tour, to help users choose clustering parameters and assess the quality of different clustering results in relation to current analysis goals and user expectations. We evaluate \textit{Clustrophile 2} through a user study with 12 data scientists, who used our tool to explore and interpret sub-cohorts in a dataset of Parkinson's disease patients. Results suggest that \textit{Clustrophile 2} improves the speed and effectiveness of exploratory clustering analysis for both experts and non-experts.