AITopics

Power quality (PQ) events are recorded by PQ meters whenever anomalous events are detected on the power grid. Using neural networks with machine learning can aid in accurately classifying the recorded waveforms and help power system engineers diagnose and rectify the root causes of problems. However, many of the waveforms captured during a disturbance in the power system need to be labeled for supervised learning, leaving a large number of data recordings for engineers to process manually or go unseen. This paper presents an autoencoder and K-means clustering-based unsupervised technique that can be used to cluster PQ events into categories like sag, interruption, transients, normal, and harmonic distortion to enable filtering of anomalous waveforms from recurring or normal waveforms. The method is demonstrated using three-phase, field-obtained voltage waveforms recorded in a distribution grid. First, a convolutional autoencoder compresses the input signals into a set of lower feature dimensions which, after further processing, is passed to the K-means algorithm to identify data clusters. Using a small, labeled dataset, numerical labels are then assigned to events based on a cosine similarity analysis. Finally, the study analyzes the clusters using the t-distributed stochastic neighbor embedding (t-SNE) visualization tool, demonstrating that the technique can help investigate a large number of captured events in a quick manner.

artificial intelligence, machine learning, waveform, (16 more...)

2306.06124

Country:

North America > United States > Florida > Leon County > Tallahassee (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (0.40)

Industry: Energy > Power Industry (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Chen, Dieyi, Jin, Jiashun, Ke, Zheng Tracy

Subject clustering by IF-PCA and several recent methods

Subject clustering (i.e., the use of measured features to cluster subjects, such as patients or cells, into multiple groups) is a problem of great interest. In recent years, many approaches were proposed, among which unsupervised deep learning (UDL) has received a great deal of attention. Two interesting questions are (a) how to combine the strengths of UDL and other approaches, and (b) how these approaches compare to one other. We combine Variational Auto-Encoder (VAE), a popular UDL approach, with the recent idea of Influential Feature PCA (IF-PCA), and propose IF-VAE as a new method for subject clustering. We study IF-VAE and compare it with several other methods (including IF-PCA, VAE, Seurat, and SC3) on $10$ gene microarray data sets and $8$ single-cell RNA-seq data sets. We find that IF-VAE significantly improves over VAE, but still underperforms IF-PCA. We also find that IF-PCA is quite competitive, which slightly outperforms Seurat and SC3 over the $8$ single-cell data sets. IF-PCA is conceptually simple and permits delicate analysis. We demonstrate that IF-PCA is capable of achieving the phase transition in a Rare/Weak model. Comparatively, Seurat and SC3 are more complex and theoretically difficult to analyze (for these reasons, their optimality remains unclear).

bioinformatics, data mining, machine learning, (17 more...)

2306.05363

Country: North America > United States > California > Los Angeles County > Los Angeles (0.14)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Biomedical Informatics (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)
(2 more...)

Actively Supervised Clustering for Open Relation Extraction

Zhao, Jun, Zhang, Yongxin, Zhang, Qi, Gui, Tao, Wei, Zhongyu, Peng, Minlong, Sun, Mingming

Current clustering-based Open Relation Extraction (OpenRE) methods usually adopt a two-stage pipeline. The first stage simultaneously learns relation representations and assignments. The second stage manually labels several instances and thus names the relation for each cluster. However, unsupervised objectives struggle to optimize the model to derive accurate clustering assignments, and the number of clusters has to be supplied in advance. In this paper, we present a novel setting, named actively supervised clustering for OpenRE. Our insight lies in that clustering learning and relation labeling can be alternately performed, providing the necessary guidance for clustering without a significant increase in human effort. The key to the setting is selecting which instances to label. Instead of using classical active labeling strategies designed for fixed known classes, we propose a new strategy, which is applicable to dynamically discover clusters of unknown relations. Experimental results show that our method is able to discover almost all relational clusters in the data and improve the SOTA methods by 10.3\% and 5.2\%, on two datasets respectively.

machine learning, natural language, relation, (18 more...)

2306.04968

Country:

Europe > Italy > Tuscany > Florence (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Shanghai > Shanghai (0.04)
(3 more...)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv4TGC: Large-Scale Datasets for Temporal Graph Clustering

Liu, Meng, Liang, Ke, Liu, Yue, Wang, Siwei, Zhou, Sihang, Liu, Xinwang

Temporal graph clustering (TGC) is a crucial task in temporal graph learning. Its focus is on node clustering on temporal graphs, and it offers greater flexibility for large-scale graph structures due to the mechanism of temporal graph methods. However, the development of TGC is currently constrained by a significant problem: the lack of suitable and reliable large-scale temporal graph datasets to evaluate clustering performance. In other words, most existing temporal graph datasets are in small sizes, and even large-scale datasets contain only a limited number of available node labels. It makes evaluating models for large-scale temporal graph clustering challenging. To address this challenge, we build arXiv4TGC, a set of novel academic datasets (including arXivAI, arXivCS, arXivMath, arXivPhy, and arXivLarge) for large-scale temporal graph clustering. In particular, the largest dataset, arXivLarge, contains 1.3 million labeled available nodes and 10 million temporal edges. We further compare the clustering performance with typical temporal graph learning models on both previous classic temporal graph datasets and the new datasets proposed in this paper. The clustering performance on arXiv4TGC can be more apparent for evaluating different models, resulting in higher clustering confidence and more suitable for large-scale temporal graph clustering. The arXiv4TGC datasets are publicly available at: https://github.com/MGitHubL/arXiv4TGC.

artificial intelligence, dataset, machine learning, (16 more...)

2306.04962

Country:

North America > United States > California > Orange County > Irvine (0.04)
Asia > China > Hunan Province > Changsha (0.04)

Genre: Research Report (0.50)

Industry:

Information Technology (0.69)
Education (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

When to Pre-Train Graph Neural Networks? From Data Generation Perspective!

Cao, Yuxuan, Xu, Jiarong, Yang, Carl, Wang, Jiaan, Zhang, Yunchao, Wang, Chunping, Chen, Lei, Yang, Yang

In recent years, graph pre-training has gained significant attention, focusing on acquiring transferable knowledge from unlabeled graph data to improve downstream performance. Despite these recent endeavors, the problem of negative transfer remains a major concern when utilizing graph pre-trained models to downstream tasks. Previous studies made great efforts on the issue of what to pre-train and how to pre-train by designing a variety of graph pre-training and fine-tuning strategies. However, there are cases where even the most advanced "pre-train and fine-tune" paradigms fail to yield distinct benefits. This paper introduces a generic framework W2PGNN to answer the crucial question of when to pre-train (i.e., in what situations could we take advantage of graph pre-training) before performing effortful pre-training or fine-tuning. We start from a new perspective to explore the complex generative mechanisms from the pre-training data to downstream data. In particular, W2PGNN first fits the pre-training data into graphon bases, each element of graphon basis (i.e., a graphon) identifies a fundamental transferable pattern shared by a collection of pre-training graphs. All convex combinations of graphon bases give rise to a generator space, from which graphs generated form the solution space for those downstream data that can benefit from pre-training. In this manner, the feasibility of pre-training can be quantified as the generation probability of the downstream data from any generator in the generator space. W2PGNN offers three broad applications: providing the application scope of graph pre-trained models, quantifying the feasibility of pre-training, and assistance in selecting pre-training data to enhance downstream performance. We provide a theoretically sound solution for the first application and extensive empirical justifications for the latter two applications.

artificial intelligence, graph, machine learning, (17 more...)

2303.16458

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.05)
Europe (0.05)
(3 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area (0.31)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Scalable Set Encoding with Universal Mini-Batch Consistency and Unbiased Full Set Gradient Approximation

Willette, Jeffrey, Lee, Seanie, Andreis, Bruno, Kawaguchi, Kenji, Lee, Juho, Hwang, Sung Ju

Recent work on mini-batch consistency (MBC) for set functions has brought attention to the need for sequentially processing and aggregating chunks of a partitioned set while guaranteeing the same output for all partitions. However, existing constraints on MBC architectures lead to models with limited expressive power. Additionally, prior work has not addressed how to deal with large sets during training when the full set gradient is required. To address these issues, we propose a Universally MBC (UMBC) class of set functions which can be used in conjunction with arbitrary non-MBC components while still satisfying MBC, enabling a wider range of function classes to be used in MBC settings. Furthermore, we propose an efficient MBC training algorithm which gives an unbiased approximation of the full set gradient and has a constant memory overhead for any set size for both train- and test-time. We conduct extensive experiments including image completion, text classification, unsupervised clustering, and cancer detection on high-resolution images to verify the efficiency and efficacy of our scalable set encoding framework. Our code is available at github.com/jeffwillette/umbc

machine learning, natural language, universal mini-batch consistency, (12 more...)

2208.12401

Country:

North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
Asia > Singapore (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.63)

Industry: Health & Medicine > Diagnostic Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Bengali, Vedangi, Veldt, Nate

Faster Approximation Algorithms for Parameterized Graph Clustering and Edge Labeling

Graph clustering is a fundamental task in network analysis where the goal is to detect sets of nodes that are well-connected to each other but sparsely connected to the rest of the graph. We present faster approximation algorithms for an NP-hard parameterized clustering framework called LambdaCC, which is governed by a tunable resolution parameter and generalizes many other clustering objectives such as modularity, sparsest cut, and cluster deletion. Previous LambdaCC algorithms are either heuristics with no approximation guarantees, or computationally expensive approximation algorithms. We provide fast new approximation algorithms that can be made purely combinatorial. These rely on a new parameterized edge labeling problem we introduce that generalizes previous edge labeling problems that are based on the principle of strong triadic closure and are of independent interest in social network analysis. Our methods are orders of magnitude more scalable than previous approximation algorithms and our lower bounds allow us to obtain a posteriori approximation guarantees for previous heuristics that have no approximation guarantees of their own.

artificial intelligence, machine learning, social media, (13 more...)

2306.04884

Country:

North America > United States > Texas > Brazos County > College Station (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Afghanistan > Parwan Province > Charikar (0.04)
(7 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Song, Yuxuan, Wang, Yongyu

Towards High-Performance Exploratory Data Analysis (EDA) Via Stable Equilibrium Point

Exploratory data analysis (EDA) is a vital procedure for data science projects. In this work, we introduce a stable equilibrium point (SEP) - based framework for improving the efficiency and solution quality of EDA. By exploiting the SEPs to be the representative points, our approach aims to generate high-quality clustering and data visualization for large-scale data sets. A very unique property of the proposed method is that the SEPs will directly encode the clustering properties of data sets. Compared with prior state-of-the-art clustering and data visualization methods, the proposed methods allow substantially improving computing efficiency and solution quality for large-scale data analysis tasks.

data mining, machine learning, spectral, (15 more...)

2306.04425

Country:

North America > United States > Michigan (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Industry: Government > Regional Government > North America Government > United States Government (0.30)

Technology:

Information Technology > Data Science > Data Mining (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.31)

Efficient Recruitment Strategy for Collaborative Mobile Crowd Sensing Based on GCN Trustworthiness Prediction

Zhan, Zhongwei, Wang, Yingjie, Duan, Peiyong, Sai, Akshita Maradapu Vera Venkata, Liu, Zhaowei, Xiang, Chaocan, Tong, Xiangrong, Wang, Weilong, Cai, Zhipeng

Collaborative Mobile Crowd Sensing (CMCS) enhances data quality and coverage by promoting teamwork in task sensing, with worker recruitment representing a complex multi-objective optimization problem. Existing strategies mainly focus on the characteristics of workers themselves, neglecting the asymmetric trust relationships between them, which affects the rationality of task utility evaluation. To address this, this paper first employs the Mini-Batch K-Means clustering algorithm and deploys edge servers to enable efficient distributed worker recruitment. Historical data and task requirements are utilized to obtain workers' ability types and distances. A trust-directed graph in the worker's social network is input into the Graph Convolutional Network (GCN) framework for training, capturing asymmetric trustworthiness between worker pairs. Privacy leakage is prevented in CMCS scenarios through high trust values between workers. Ultimately, an undirected recruitment graph is constructed using workers' abilities, trust values, and distance weights, transforming the worker recruitment problem into a Maximum Weight Average Subgraph Problem (MWASP). A Tabu Search Recruitment (TSR) algorithm is proposed to rationally recruit a balanced multi-objective optimal task utility worker set for each task. Extensive simulation experiments on four real-world datasets demonstrate the effectiveness of the proposed strategy, outperforming other strategies.

artificial intelligence, data mining, machine learning, (17 more...)

2306.04366

Country:

Asia > China > Shandong Province > Yantai (0.06)
Asia > China > Chongqing Province > Chongqing (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
Asia > Nepal (0.04)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.93)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.54)

Yousefi, Shahin, Hooshmand, Mohsen, Afsharchi, Mohsen

Early Discovery of Emerging Entities in Persian Twitter with Semantic Similarity

Discovering emerging entities (EEs) is the problem of finding entities before their establishment. These entities can be critical for individuals, companies, and governments. Many of these entities can be discovered on social media platforms, e.g. Twitter. These identities have been the spot of research in academia and industry in recent years. Similar to any machine learning problem, data availability is one of the major challenges in this problem. This paper proposes EEPT. That is an online clustering method able to discover EEs without any need for training on a dataset. Additionally, due to the lack of a proper evaluation metric, this paper uses a new metric to evaluate the results. The results show that EEPT is promising and finds significant entities before their establishment.

artificial intelligence, machine learning, natural language, (18 more...)

doi: 10.1109/ICWR57742.2023.10139028

2207.02434

Country:

Asia > Middle East > Iran > Zanjan Province > Zanjan (0.05)
Europe > Spain > Valencian Community > Valencia Province > Valencia (0.04)
Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)