Clustering
Unsupervised clustering of disturbances in power systems via deep convolutional autoencoders
Islam, Md Maidul, Faruque, Md Omar, Butterfield, Joshua, Singh, Gaurav, Cooke, Thomas A.
Power quality (PQ) events are recorded by PQ meters whenever anomalous events are detected on the power grid. Using neural networks with machine learning can aid in accurately classifying the recorded waveforms and help power system engineers diagnose and rectify the root causes of problems. However, many of the waveforms captured during a disturbance in the power system need to be labeled for supervised learning, leaving a large number of data recordings for engineers to process manually or go unseen. This paper presents an autoencoder and K-means clustering-based unsupervised technique that can be used to cluster PQ events into categories like sag, interruption, transients, normal, and harmonic distortion to enable filtering of anomalous waveforms from recurring or normal waveforms. The method is demonstrated using three-phase, field-obtained voltage waveforms recorded in a distribution grid. First, a convolutional autoencoder compresses the input signals into a set of lower feature dimensions which, after further processing, is passed to the K-means algorithm to identify data clusters. Using a small, labeled dataset, numerical labels are then assigned to events based on a cosine similarity analysis. Finally, the study analyzes the clusters using the t-distributed stochastic neighbor embedding (t-SNE) visualization tool, demonstrating that the technique can help investigate a large number of captured events in a quick manner.
Subject clustering by IF-PCA and several recent methods
Chen, Dieyi, Jin, Jiashun, Ke, Zheng Tracy
Subject clustering (i.e., the use of measured features to cluster subjects, such as patients or cells, into multiple groups) is a problem of great interest. In recent years, many approaches were proposed, among which unsupervised deep learning (UDL) has received a great deal of attention. Two interesting questions are (a) how to combine the strengths of UDL and other approaches, and (b) how these approaches compare to one other. We combine Variational Auto-Encoder (VAE), a popular UDL approach, with the recent idea of Influential Feature PCA (IF-PCA), and propose IF-VAE as a new method for subject clustering. We study IF-VAE and compare it with several other methods (including IF-PCA, VAE, Seurat, and SC3) on $10$ gene microarray data sets and $8$ single-cell RNA-seq data sets. We find that IF-VAE significantly improves over VAE, but still underperforms IF-PCA. We also find that IF-PCA is quite competitive, which slightly outperforms Seurat and SC3 over the $8$ single-cell data sets. IF-PCA is conceptually simple and permits delicate analysis. We demonstrate that IF-PCA is capable of achieving the phase transition in a Rare/Weak model. Comparatively, Seurat and SC3 are more complex and theoretically difficult to analyze (for these reasons, their optimality remains unclear).
Actively Supervised Clustering for Open Relation Extraction
Zhao, Jun, Zhang, Yongxin, Zhang, Qi, Gui, Tao, Wei, Zhongyu, Peng, Minlong, Sun, Mingming
Current clustering-based Open Relation Extraction (OpenRE) methods usually adopt a two-stage pipeline. The first stage simultaneously learns relation representations and assignments. The second stage manually labels several instances and thus names the relation for each cluster. However, unsupervised objectives struggle to optimize the model to derive accurate clustering assignments, and the number of clusters has to be supplied in advance. In this paper, we present a novel setting, named actively supervised clustering for OpenRE. Our insight lies in that clustering learning and relation labeling can be alternately performed, providing the necessary guidance for clustering without a significant increase in human effort. The key to the setting is selecting which instances to label. Instead of using classical active labeling strategies designed for fixed known classes, we propose a new strategy, which is applicable to dynamically discover clusters of unknown relations. Experimental results show that our method is able to discover almost all relational clusters in the data and improve the SOTA methods by 10.3\% and 5.2\%, on two datasets respectively.
arXiv4TGC: Large-Scale Datasets for Temporal Graph Clustering
Liu, Meng, Liang, Ke, Liu, Yue, Wang, Siwei, Zhou, Sihang, Liu, Xinwang
Temporal graph clustering (TGC) is a crucial task in temporal graph learning. Its focus is on node clustering on temporal graphs, and it offers greater flexibility for large-scale graph structures due to the mechanism of temporal graph methods. However, the development of TGC is currently constrained by a significant problem: the lack of suitable and reliable large-scale temporal graph datasets to evaluate clustering performance. In other words, most existing temporal graph datasets are in small sizes, and even large-scale datasets contain only a limited number of available node labels. It makes evaluating models for large-scale temporal graph clustering challenging. To address this challenge, we build arXiv4TGC, a set of novel academic datasets (including arXivAI, arXivCS, arXivMath, arXivPhy, and arXivLarge) for large-scale temporal graph clustering. In particular, the largest dataset, arXivLarge, contains 1.3 million labeled available nodes and 10 million temporal edges. We further compare the clustering performance with typical temporal graph learning models on both previous classic temporal graph datasets and the new datasets proposed in this paper. The clustering performance on arXiv4TGC can be more apparent for evaluating different models, resulting in higher clustering confidence and more suitable for large-scale temporal graph clustering. The arXiv4TGC datasets are publicly available at: https://github.com/MGitHubL/arXiv4TGC.
When to Pre-Train Graph Neural Networks? From Data Generation Perspective!
Cao, Yuxuan, Xu, Jiarong, Yang, Carl, Wang, Jiaan, Zhang, Yunchao, Wang, Chunping, Chen, Lei, Yang, Yang
In recent years, graph pre-training has gained significant attention, focusing on acquiring transferable knowledge from unlabeled graph data to improve downstream performance. Despite these recent endeavors, the problem of negative transfer remains a major concern when utilizing graph pre-trained models to downstream tasks. Previous studies made great efforts on the issue of what to pre-train and how to pre-train by designing a variety of graph pre-training and fine-tuning strategies. However, there are cases where even the most advanced "pre-train and fine-tune" paradigms fail to yield distinct benefits. This paper introduces a generic framework W2PGNN to answer the crucial question of when to pre-train (i.e., in what situations could we take advantage of graph pre-training) before performing effortful pre-training or fine-tuning. We start from a new perspective to explore the complex generative mechanisms from the pre-training data to downstream data. In particular, W2PGNN first fits the pre-training data into graphon bases, each element of graphon basis (i.e., a graphon) identifies a fundamental transferable pattern shared by a collection of pre-training graphs. All convex combinations of graphon bases give rise to a generator space, from which graphs generated form the solution space for those downstream data that can benefit from pre-training. In this manner, the feasibility of pre-training can be quantified as the generation probability of the downstream data from any generator in the generator space. W2PGNN offers three broad applications: providing the application scope of graph pre-trained models, quantifying the feasibility of pre-training, and assistance in selecting pre-training data to enhance downstream performance. We provide a theoretically sound solution for the first application and extensive empirical justifications for the latter two applications.
Scalable Set Encoding with Universal Mini-Batch Consistency and Unbiased Full Set Gradient Approximation
Willette, Jeffrey, Lee, Seanie, Andreis, Bruno, Kawaguchi, Kenji, Lee, Juho, Hwang, Sung Ju
Recent work on mini-batch consistency (MBC) for set functions has brought attention to the need for sequentially processing and aggregating chunks of a partitioned set while guaranteeing the same output for all partitions. However, existing constraints on MBC architectures lead to models with limited expressive power. Additionally, prior work has not addressed how to deal with large sets during training when the full set gradient is required. To address these issues, we propose a Universally MBC (UMBC) class of set functions which can be used in conjunction with arbitrary non-MBC components while still satisfying MBC, enabling a wider range of function classes to be used in MBC settings. Furthermore, we propose an efficient MBC training algorithm which gives an unbiased approximation of the full set gradient and has a constant memory overhead for any set size for both train- and test-time. We conduct extensive experiments including image completion, text classification, unsupervised clustering, and cancer detection on high-resolution images to verify the efficiency and efficacy of our scalable set encoding framework. Our code is available at github.com/jeffwillette/umbc
Faster Approximation Algorithms for Parameterized Graph Clustering and Edge Labeling
Graph clustering is a fundamental task in network analysis where the goal is to detect sets of nodes that are well-connected to each other but sparsely connected to the rest of the graph. We present faster approximation algorithms for an NP-hard parameterized clustering framework called LambdaCC, which is governed by a tunable resolution parameter and generalizes many other clustering objectives such as modularity, sparsest cut, and cluster deletion. Previous LambdaCC algorithms are either heuristics with no approximation guarantees, or computationally expensive approximation algorithms. We provide fast new approximation algorithms that can be made purely combinatorial. These rely on a new parameterized edge labeling problem we introduce that generalizes previous edge labeling problems that are based on the principle of strong triadic closure and are of independent interest in social network analysis. Our methods are orders of magnitude more scalable than previous approximation algorithms and our lower bounds allow us to obtain a posteriori approximation guarantees for previous heuristics that have no approximation guarantees of their own.
Towards High-Performance Exploratory Data Analysis (EDA) Via Stable Equilibrium Point
Exploratory data analysis (EDA) is a vital procedure for data science projects. In this work, we introduce a stable equilibrium point (SEP) - based framework for improving the efficiency and solution quality of EDA. By exploiting the SEPs to be the representative points, our approach aims to generate high-quality clustering and data visualization for large-scale data sets. A very unique property of the proposed method is that the SEPs will directly encode the clustering properties of data sets. Compared with prior state-of-the-art clustering and data visualization methods, the proposed methods allow substantially improving computing efficiency and solution quality for large-scale data analysis tasks.
Efficient Recruitment Strategy for Collaborative Mobile Crowd Sensing Based on GCN Trustworthiness Prediction
Zhan, Zhongwei, Wang, Yingjie, Duan, Peiyong, Sai, Akshita Maradapu Vera Venkata, Liu, Zhaowei, Xiang, Chaocan, Tong, Xiangrong, Wang, Weilong, Cai, Zhipeng
Collaborative Mobile Crowd Sensing (CMCS) enhances data quality and coverage by promoting teamwork in task sensing, with worker recruitment representing a complex multi-objective optimization problem. Existing strategies mainly focus on the characteristics of workers themselves, neglecting the asymmetric trust relationships between them, which affects the rationality of task utility evaluation. To address this, this paper first employs the Mini-Batch K-Means clustering algorithm and deploys edge servers to enable efficient distributed worker recruitment. Historical data and task requirements are utilized to obtain workers' ability types and distances. A trust-directed graph in the worker's social network is input into the Graph Convolutional Network (GCN) framework for training, capturing asymmetric trustworthiness between worker pairs. Privacy leakage is prevented in CMCS scenarios through high trust values between workers. Ultimately, an undirected recruitment graph is constructed using workers' abilities, trust values, and distance weights, transforming the worker recruitment problem into a Maximum Weight Average Subgraph Problem (MWASP). A Tabu Search Recruitment (TSR) algorithm is proposed to rationally recruit a balanced multi-objective optimal task utility worker set for each task. Extensive simulation experiments on four real-world datasets demonstrate the effectiveness of the proposed strategy, outperforming other strategies.
Early Discovery of Emerging Entities in Persian Twitter with Semantic Similarity
Yousefi, Shahin, Hooshmand, Mohsen, Afsharchi, Mohsen
Discovering emerging entities (EEs) is the problem of finding entities before their establishment. These entities can be critical for individuals, companies, and governments. Many of these entities can be discovered on social media platforms, e.g. Twitter. These identities have been the spot of research in academia and industry in recent years. Similar to any machine learning problem, data availability is one of the major challenges in this problem. This paper proposes EEPT. That is an online clustering method able to discover EEs without any need for training on a dataset. Additionally, due to the lack of a proper evaluation metric, this paper uses a new metric to evaluate the results. The results show that EEPT is promising and finds significant entities before their establishment.