Clustering
Clustering Residential Electricity Consumption Data to Create Archetypes that Capture Variability in Customer Behaviour
Toussaint, Wiebke, Moodley, Deshendran
Clustering is frequently used in the energy domain to identify dominant electricity consumption patterns of households, which can be used to construct customer archetypes for long term energy planning. Selecting a useful set of clusters however requires extensive experimentation and domain knowledge. While internal clustering validation measures are well established in the electricity domain, limited research is available for external measures. We present a method that distills expert knowledge into competency questions, which we operationalised as external evaluation measures to specify the clustering objective for our application. This approach supported a structured and formal cluster validation process that combined internal and external measures to select a cluster set that is useful for creating residential electricity customer archetypes from electricity meter data in South Africa. We validated the approach in a case study application where we successfully reconstructed customer archetypes previously developed by experts. Our approach enables transparent and repeatable cluster ranking and selection by data scientists, even if they have limited domain knowledge.
Node Embeddings and Exact Low-Rank Representations of Complex Networks
Chanpuriya, Sudhanshu, Musco, Cameron, Sotiropoulos, Konstantinos, Tsourakakis, Charalampos E.
Low-dimensional embeddings, from classical spectral embeddings to modern neural-net-inspired methods, are a cornerstone in the modeling and analysis of complex networks. Recent work by Seshadhri et al. (PNAS 2020) suggests that such embeddings cannot capture local structure arising in complex networks. In particular, they show that any network generated from a natural low-dimensional model cannot be both sparse and have high triangle density (high clustering coefficient), two hallmark properties of many real-world networks. In this work we show that the results of Seshadhri et al. are intimately connected to the model they use rather than the low-dimensional structure of complex networks. Specifically, we prove that a minor relaxation of their model can generate sparse graphs with high triangle density. Surprisingly, we show that this same model leads to exact low-dimensional factorizations of many real-world networks. We give a simple algorithm based on logistic principal component analysis (LPCA) that succeeds in finding such exact embeddings. Finally, we perform a large number of experiments that verify the ability of very low-dimensional embeddings to capture local structure in real-world networks.
Cascade of Phase Transitions for Multi-Scale Clustering
Bonnaire, T., Decelle, A., Aghanim, N.
Following these steps, we aim at showing how the latter formulation can be useful to understand and analyse Many optimisation and inference problems have been the outcome of GMMs. In particular, we exploit the shown to have an equivalent formulation in statistical cascade of phase transitions occurring during annealing physics [1, 2] that allowed a brand-new look at some longstanding procedures of the EM algorithm to build a hierarchical problems and improved the understanding of multi-scale description of a dataset. By defining an overlap complex systems [3, 4]. In particular, the identification of between the ground truth and the inferred partitions, the phase diagram of a model can bring interesting new we show on artificial datasets how it can be interpreted insights such as knowing if a given information can be as an order parameter whose value follows the sequence retrieved depending on the model's parameters and the of phase transitions.
Aerodynamic Data Predictions Based on Multi-task Learning
Hu, Liwei, Xiang, Yu, Zhan, Jun, Shi, Zifang, Wang, Wenzheng
The quality of datasets is one of the key factors that affect the accuracy of aerodynamic data models. For example, in the uniformly sampled Burgers' dataset, the insufficient high-speed data is overwhelmed by massive low-speed data. Predicting high-speed data is more difficult than predicting low-speed data, owing to that the number of high-speed data is limited, i.e. the quality of the Burgers' dataset is not satisfactory. To improve the quality of datasets, traditional methods usually employ the data resampling technology to produce enough data for the insufficient parts in the original datasets before modeling, which increases computational costs. Recently, the mixtures of experts have been used in natural language processing to deal with different parts of sentences, which provides a solution for eliminating the need for data resampling in aerodynamic data modeling. Motivated by this, we propose the multi-task learning (MTL), a datasets quality-adaptive learning scheme, which combines task allocation and aerodynamic characteristics learning together to disperse the pressure of the entire learning task. The task allocation divides a whole learning task into several independent subtasks, while the aerodynamic characteristics learning learns these subtasks simultaneously to achieve better precision. Two experiments with poor quality datasets are conducted to verify the data quality-adaptivity of the MTL to datasets. The results show than the MTL is more accurate than FCNs and GANs in poor quality datasets.
Refining Similarity Matrices to Cluster Attributed Networks Accurately
Yajima, Yuta, Inokuchi, Akihiro
As a result of the recent popularity of social networks and the increase in the number of research papers published across all fields, attributed networks consisting of relationships between objects, such as humans and the papers, that have attributes are becoming increasingly large. Therefore, various studies for clustering attributed networks into sub-networks are being actively conducted. When clustering attributed networks using spectral clustering, the clustering accuracy is strongly affected by the quality of the similarity matrices, which are input into spectral clustering and represent the similarities between pairs of objects. In this paper, we aim to increase the accuracy by refining the matrices before applying spectral clustering to them. We verify the practicability of our proposed method by comparing the accuracy of spectral clustering with similarity matrices before and after refining them.
Hierarchy Clustering
Just like this KMeans clustering, our intention is to create clusters within our dataset, grouping related data so that we may determine different classes or groupings, to allow us to make predictions based on this information in a wide array of applications. However, in the areas in which KMeans fails, Hierarchy Clustering attempts to alleviate the burden somewhat with its several choices of novel techniques such as single-link clustering, or Ward Clustering, Hierarchy Clustering techniques chosen by the user, depending on the layout of their dataset. Hierarchy Clustering at the end of the day is just a regular clustering algorithm, with its advantages and disadvantages, and is by no means the successor of KMeans. Generally, Hierarchy Clustering works as follows; You start off with your dataset, which may be spaced out strange or have a weird layout with strange densities, where clusters are not easily differentiable by you, by all means, you honestly have no idea. That's alright, that usually is the case in real-world problems.
Mixed data Deep Gaussian Mixture Model: A clustering model for mixed datasets
Fuchs, Robin, Pommeret, Denys, Viroli, Cinzia
Clustering mixed data presents numerous challenges inherent to the very heterogeneous nature of the variables. Two major difficulties lie in the initialisation of the algorithms and in making variables comparable between types. This work is concerned with these two problems. We introduce a two-heads architecture model-based clustering method called Mixed data Deep Gaussian Mixture Model (MDGMM) that can be viewed as an automatic way to merge the clusterings performed separately on continuous and non continuous data. We also design a new initialisation strategy and a data driven method that selects "on the fly" the best specification of the model and the optimal number of clusters for a given dataset. Besides, our model provides continuous low-dimensional representations of the data which can be a useful tool to visualize mixed datasets. Finally, we validate the performance of our approach comparing its results with state-of-the-art mixed data clustering models over several commonly used datasets
Penalized model-based clustering of fMRI data
DiLernia, Andrew, Quevedo, Karina, Camchong, Jazmin, Lim, Kelvin, Pan, Wei, Zhang, Lin
Functional magnetic resonance imaging (fMRI) data have become increasingly available and are useful for describing functional connectivity (FC), the relatedness of neuronal activity in regions of the brain. This FC of the brain provides insight into certain neurodegenerative diseases and psychiatric disorders, and thus is of clinical importance. To help inform physicians regarding patient diagnoses, unsupervised clustering of subjects based on FC is desired, allowing the data to inform us of groupings of patients based on shared features of connectivity. Since heterogeneity in FC is present even between patients within the same group, it is important to allow subject-level differences in connectivity, while still pooling information across patients within each group to describe group-level FC. To this end, we propose a random covariance clustering model (RCCM) to concurrently cluster subjects based on their FC networks, estimate the unique FC networks of each subject, and to infer shared network features. Although current methods exist for estimating FC or clustering subjects using fMRI data, our novel contribution is to cluster or group subjects based on similar FC of the brain while simultaneously providing group- and subject-level FC network estimates. The competitive performance of RCCM relative to other methods is demonstrated through simulations in various settings, achieving both improved clustering of subjects and estimation of FC networks. Utility of the proposed method is demonstrated with application to a resting-state fMRI data set collected on 43 healthy controls and 61 participants diagnosed with schizophrenia.
The Impact of Isolation Kernel on Agglomerative Hierarchical Clustering Algorithms
Han, Xin, Zhu, Ye, Ting, Kai Ming, Li, Gang
Agglomerative hierarchical clustering (AHC) is one of the popular clustering approaches. Existing AHC methods, which are based on a distance measure, have one key issue: it has difficulty in identifying adjacent clusters with varied densities, regardless of the cluster extraction methods applied on the resultant dendrogram. In this paper, we identify the root cause of this issue and show that the use of a data-dependent kernel (instead of distance or existing kernel) provides an effective means to address it. We analyse the condition under which existing AHC methods fail to extract clusters effectively; and the reason why the data-dependent kernel is an effective remedy. This leads to a new approach to kernerlise existing hierarchical clustering algorithms such as existing traditional AHC algorithms, HDBSCAN, GDL and PHA. In each of these algorithms, our empirical evaluation shows that a recently introduced Isolation Kernel produces a higher quality or purer dendrogram than distance, Gaussian Kernel and adaptive Gaussian Kernel.
Local Connectivity in Centroid Clustering
Clustering is a fundamental task in unsupervised learning, one that targets to group a dataset into clusters of similar objects. There has been recent interest in embedding normative considerations around fairness within clustering formulations. In this paper, we propose 'local connectivity' as a crucial factor in assessing membership desert in centroid clustering. We use local connectivity to refer to the support offered by the local neighborhood of an object towards supporting its membership to the cluster in question. We motivate the need to consider local connectivity of objects in cluster assignment, and provide ways to quantify local connectivity in a given clustering. We then exploit concepts from density-based clustering and devise LOFKM, a clustering method that seeks to deepen local connectivity in clustering outputs, while staying within the framework of centroid clustering. Through an empirical evaluation over real-world datasets, we illustrate that LOFKM achieves notable improvements in local connectivity at reasonable costs to clustering quality, illustrating the effectiveness of the method.