Goto

Collaborating Authors

 Clustering


Hierarchical Clustering for Smart Meter Electricity Loads based on Quantile Autocovariances

arXiv.org Machine Learning

In order to improve the efficiency and sustainability of electricity systems, most countries worldwide are deploying advanced metering infrastructures, and in particular household smart meters, in the residential sector. This technology is able to record electricity load time series at a very high frequency rates, information that can be exploited to develop new clustering models to group individual households by similar consumptions patterns. To this end, in this work we propose three hierarchical clustering methodologies that allow capturing different characteristics of the time series. These are based on a set of "dissimilarity" measures computed over different features: quantile auto-covariances, and simple and partial autocorrelations. The main advantage is that they allow summarizing each time series in a few representative features so that they are computationally efficient, robust against outliers, easy to automatize, and scalable to hundreds of thousands of smart meters series. We evaluate the performance of each clustering model in a real-world smart meter dataset with thousands of half-hourly time series. The results show how the obtained clusters identify relevant consumption behaviors of households and capture part of their geo-demographic segmentation. Moreover, we apply a supervised classification procedure to explore which features are more relevant to define each cluster.


Unsupervised Learning Algorithms in One Picture

#artificialintelligence

Unsupervised learning algorithms are "unsupervised" because you let them run without direct supervision. You feed the data into the algorithm, and the algorithm figures out the patterns. The following picture shows the differences between three of the most popular unsupervised learning algorithms: Principal Component Analysis, k-Means clustering and Hierarchical clustering. The three are closely related, because data clustering is a type of data reduction; PCA can be viewed as a continuous counterpart of K-Means (see Ding & He, 2004).


Customer Segmentation Using K Means Clustering - KDnuggets

#artificialintelligence

Customer Segmentation is the subdivision of a market into discrete customer groups that share similar characteristics. Customer Segmentation can be a powerful means to identify unsatisfied customer needs. Using the above data companies can then outperform the competition by developing uniquely appealing products and services. You are owing a supermarket mall and through membership cards, you have some basic data about your customers like Customer ID, age, gender, annual income and spending score. You want to understand the customers like who are the target customers so that the sense can be given to marketing team and plan the strategy accordingly.


Hyper-SAGNN: a self-attention based graph neural network for hypergraphs

arXiv.org Machine Learning

Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.


Iterative Algorithm for Discrete Structure Recovery

arXiv.org Machine Learning

We propose a general modeling and algorithmic framework for discrete structure recovery that can be applied to a wide range of problems. Under this framework, we are able to study the recovery of clustering labels, ranks of players, and signs of regression coefficients from a unified perspective. A simple iterative algorithm is proposed for discrete structure recovery, which generalizes methods including Lloyd's algorithm and the iterative feature matching algorithm. A linear convergence result for the proposed algorithm is established in this paper under appropriate abstract conditions on stochastic errors and initialization. We illustrate our general theory by applying it on three representative problems: clustering in Gaussian mixture model, approximate ranking, and sign recovery in compressed sensing, and show that minimax rate is achieved in each case.


Novel semi-metrics for multivariate change point analysis and anomaly detection

arXiv.org Machine Learning

This paper proposes a new method for determining similarity and anomalies between time series, most practically effective in large collections of (likely related) time series, with a particular focus on measuring distances between structural breaks within such a collection. We consolidate and generalise a class of semi-metric distance measures, which we term MJ distances. Experiments on simulated data demonstrate that our proposed family of distances uncover similarity within collections of time series more effectively than measures such as the Hausdorff and Wasserstein metrics. Although our class of distances do not necessarily satisfy the triangle inequality requirement of a metric, we analyse the transitivity properties of respective distance matrices in various contextual scenarios. There, we demonstrate a trade-off between robust performance in the presence of outliers, and the triangle inequality property. We show in experiments using real data that the contrived scenarios that severely violate the transitivity property rarely exhibit themselves in real data; instead, our family of measures satisfies all the properties of a metric most of the time. We illustrate three ways of analysing the distance and similarity matrices, via eigenvalue analysis, hierarchical clustering, and spectral clustering. The results from our hierarchical and spectral clustering experiments on simulated data demonstrate that the Hausdorff and Wasserstein metrics may lead to erroneous inference as to which time series are most similar with respect to their structural breaks, while our semi-metrics provide an improvement.


Clustering in Partially Labeled Stochastic Block Models via Total Variation Minimization

arXiv.org Machine Learning

A main task in data analysis is to organize data points into coherent groups or clusters. The stochastic block model is a probabilistic model for the cluster structure. This model prescribes different probabilities for the presence of edges within a cluster and between different clusters. We assume that the cluster assignments are known for at least one data point in each cluster. In such a partially labeled stochastic block model, clustering amounts to estimating the cluster assignments of the remaining data points. We study total variation minimization as a method for this clustering task. We implement the resulting clustering algorithm as a highly scalable message passing protocol. We also provide a condition on the model parameters such that total variation minimization allows for accurate clustering.


Measuring Similarity of Interactive Driving Behaviors Using Matrix Profile

arXiv.org Artificial Intelligence

-- Understanding multi-vehicle interactive behaviors with temporal sequential observations is crucial for autonomous vehicles to make appropriate decisions in an uncertain traffic environment. On-demand similarity measures are significant for autonomous vehicles to deal with massive interactive driving behaviors by clustering and classifying diverse scenarios. This paper proposes a general approach for measuring spatiotemporal similarity of interactive behaviors using a multivariate matrix profile technique. The key attractive features of the approach are its superior space and time complexity, real-time online computing for streaming traffic data, and possible capability of leveraging hardware for parallel computation. The proposed approach is validated through automatically discovering similar interactive driving behaviors at intersections from sequential data. One of the biggest challenges for deploying autonomous vehicles (A Vs) in real life is the requirement of the A Vs' capability to interact with surrounding road users. Classifying diverse scenarios and separately designing appropriate decisions using on-hand prior knowledge is unfortunately not realistic [1] because of the diversity of scenarios that are far larger and messier than human beings can cope with [2].


24 Best Data Science Certification & Courses 2019 Digital Learning Land

#artificialintelligence

Are you looking for Best Data Science Certification? With these best data science online courses, Degree, Training, Classes, and Tutorial 2019 you can improve your precise skills and become a Data Scientist. Data science introduces the incorporation of programming, statistical skills, machine learning, and algorithms. These best Data Science tutorials will make you skilled in all insights of Data Science. In this modernized time, most organizations and companies are opening their opportunity to Data Science. Companies are now concentrating on Data Science to increase their business. So there is a huge demand for data scientist and people who are interested to build their career in this field there is a tremendous chance for them. Data Science is a method that combines numerous segments. In these following courses, you will gain in-depth knowledge of Data Science. Python is one of the high-level programming languages. Those who are highly interested in machine learning this course is suggested to them. This course is an overview of machine learning both in python and R. This course is the BESTSELLER course of Machine Learning. Anyone who is not satisfied with his job to want to become a data scientist and want to start a career in data science highly recommended to do this course. This course will explore all the different fields of machine learning. The purpose of courses to teach the learner how to create machine learning algorithms in Python and R from to data science experts. This is the BESTSELLER course. If you want to learn how you will be the master in machine learning on Python and R this course is for you. Super Data science team and super data science support also instructed this course. This instructors doing their job creatively for covering all the gaps of the learner also provides helps for the better of the learning process. About 380,693 students enrolled in this course and the rating is 4.5.


UrbanRhythm: Revealing Urban Dynamics Hidden in Mobility Data

arXiv.org Machine Learning

Understanding urban dynamics, i.e., how the types and intensity of urban residents' activities in the city change along with time, is of urgent demand for building an efficient and livable city. Nonetheless, this is challenging due to the expanding urban population and the complicated spatial distribution of residents. In this paper, to reveal urban dynamics, we propose a novel system UrbanRhythm to reveal the urban dynamics hidden in human mobility data. UrbanRhythm addresses three questions: 1) What mobility feature should be used to present residents' high-dimensional activities in the city? 2) What are basic components of urban dynamics? 3) What are the long-term periodicity and short-term regularity of urban dynamics? In UrbanRhythm, we extract staying, leaving, arriving three attributes of mobility and use a image processing method Saak transform to calculate the mobility distribution feature. For the second question, several city states are identified by hierarchy clustering as the basic components of urban dynamics, such as sleeping states and working states. We further characterize the urban dynamics as the transform of city states along time axis. For the third question, we directly observe the long-term periodicity of urban dynamics from visualization. Then for the short-term regularity, we design a novel motif analysis method to discovery motifs as well as their hierarchy relationships. We evaluate our proposed system on two real-life datesets and validate the results according to App usage records. This study sheds light on urban dynamics hidden in human mobility and can further pave the way for more complicated mobility behavior modeling and deeper urban understanding.