Goto

Collaborating Authors

 Clustering


Detecting Suspicious Commenter Mob Behaviors on YouTube Using Graph2Vec

arXiv.org Artificial Intelligence

YouTube, a widely popular online platform, has transformed the dynamics of con-tent consumption and interaction for users worldwide. With its extensive range of content crea-tors and viewers, YouTube serves as a hub for video sharing, entertainment, and information dissemination. However, the exponential growth of users and their active engagement on the platform has raised concerns regarding suspicious commenter behaviors, particularly in the com-ment section. This paper presents a social network analysis-based methodology for detecting suspicious commenter mob-like behaviors among YouTube channels and the similarities therein. The method aims to characterize channels based on the level of such behavior and identify com-mon patterns across them. To evaluate the effectiveness of the proposed model, we conducted an analysis of 20 YouTube channels, consisting of 7,782 videos, 294,199 commenters, and 596,982 comments. These channels were specifically selected for propagating false views about the U.S. Military. The analysis revealed significant similarities among the channels, shedding light on the prevalence of suspicious commenter behavior. By understanding these similarities, we contribute to a better understanding of the dynamics of suspicious behavior on YouTube channels, which can inform strategies for addressing and mitigating such behavior.


Reliable and Efficient Data Collection in UAV-based IoT Networks

arXiv.org Artificial Intelligence

Internet of Things (IoT) involves sensors for monitoring and wireless networks for efficient communication. However, resource-constrained IoT devices and limitations in existing wireless technologies hinder its full potential. Integrating Unmanned Aerial Vehicles (UAVs) into IoT networks can address some challenges by expanding its' coverage, providing security, and bringing computing closer to IoT devices. Nevertheless, effective data collection in UAV-assisted IoT networks is hampered by factors, including dynamic UAV behavior, environmental variables, connectivity instability, and security considerations. In this survey, we first explore UAV-based IoT networks, focusing on communication and networking aspects. Next, we cover various UAV-based data collection methods their advantages and disadvantages, followed by a discussion on performance metrics for data collection. As this article primarily emphasizes reliable and efficient data collection in UAV-assisted IoT networks, we briefly discuss existing research on data accuracy and consistency, network connectivity, and data security and privacy to provide insights into reliable data collection. Additionally, we discuss efficient data collection strategies in UAV-based IoT networks, covering trajectory and path planning, collision avoidance, sensor network clustering, data aggregation, UAV swarm formations, and artificial intelligence for optimization. We also present two use cases of UAVs as a service for enhancing data collection reliability and efficiency. Finally, we discuss future challenges in data collection for UAV-assisted IoT networks.


Wasserstein Auto-Encoders of Merge Trees (and Persistence Diagrams)

arXiv.org Artificial Intelligence

This paper presents a computational framework for the Wasserstein auto-encoding of merge trees (MT-WAE), a novel extension of the classical auto-encoder neural network architecture to the Wasserstein metric space of merge trees. In contrast to traditional auto-encoders which operate on vectorized data, our formulation explicitly manipulates merge trees on their associated metric space at each layer of the network, resulting in superior accuracy and interpretability. Our novel neural network approach can be interpreted as a non-linear generalization of previous linear attempts [79] at merge tree encoding. It also trivially extends to persistence diagrams. Extensive experiments on public ensembles demonstrate the efficiency of our algorithms, with MT-WAE computations in the orders of minutes on average. We show the utility of our contributions in two applications adapted from previous work on merge tree encoding [79]. First, we apply MT-WAE to merge tree compression, by concisely representing them with their coordinates in the final layer of our auto-encoder. Second, we document an application to dimensionality reduction, by exploiting the latent space of our auto-encoder, for the visual analysis of ensemble data. We illustrate the versatility of our framework by introducing two penalty terms, to help preserve in the latent space both the Wasserstein distances between merge trees, as well as their clusters. In both applications, quantitative experiments assess the relevance of our framework. Finally, we provide a C++ implementation that can be used for reproducibility.


Object-Centric Learning with Slot Mixture Module

arXiv.org Artificial Intelligence

Object-centric architectures usually apply a differentiable module to the entire feature map to decompose it into sets of entity representations called slots. Some of these methods structurally resemble clustering algorithms, where the cluster's center in latent space serves as a slot representation. Slot Attention is an example of such a method, acting as a learnable analog of the soft k-means algorithm. Our work employs a learnable clustering method based on the Gaussian Mixture Model. Unlike other approaches, we represent slots not only as centers of clusters but also incorporate information about the distance between clusters and assigned vectors, leading to more expressive slot representations. Our experiments demonstrate that using this approach instead of Slot Attention improves performance in object-centric scenarios, achieving state-of-the-art results in the set property prediction task.


Robust Graph Clustering via Meta Weighting for Noisy Graphs

arXiv.org Artificial Intelligence

How can we find meaningful clusters in a graph robustly against noise edges? Graph clustering (i.e., dividing nodes into groups of similar ones) is a fundamental problem in graph analysis with applications in various fields. Recent studies have demonstrated that graph neural network (GNN) based approaches yield promising results for graph clustering. However, we observe that their performance degenerates significantly on graphs with noise edges, which are prevalent in practice. In this work, we propose MetaGC for robust GNN-based graph clustering. MetaGC employs a decomposable clustering loss function, which can be rephrased as a sum of losses over node pairs. We add a learnable weight to each node pair, and MetaGC adaptively adjusts the weights of node pairs using meta-weighting so that the weights of meaningful node pairs increase and the weights of less-meaningful ones (e.g., noise edges) decrease. We show empirically that MetaGC learns weights as intended and consequently outperforms the state-of-the-art GNN-based competitors, even when they are equipped with separate denoising schemes, on five real-world graphs under varying levels of noise. Our code and datasets are available at https://github.com/HyeonsooJo/MetaGC.


Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset

arXiv.org Artificial Intelligence

The quest for human imitative AI has been an enduring topic in AI research since its inception. The technical evolution and emerging capabilities of the latest cohort of large language models (LLMs) have reinvigorated the subject beyond academia to the cultural zeitgeist. While recent NLP evaluation benchmark tasks test some aspects of human-imitative behaviour (e.g., BIG-bench's 'human-like behavior' tasks), few, if not none, examine creative problem solving abilities. Creative problem solving in humans is a well-studied topic in cognitive neuroscience with standardized tests that predominantly use the ability to associate (heterogeneous) connections among clue words as a metric for creativity. Exposure to misleading stimuli - distractors dubbed red herrings - impede human performance in such tasks via the fixation effect and Einstellung paradigm. In cognitive neuroscience studies, such fixations are experimentally induced by pre-exposing participants to orthographically similar incorrect words to subsequent word-fragments or clues. The popular British quiz show Only Connect's Connecting Wall segment essentially mimics Mednick's Remote Associates Test (RAT) formulation with built-in, deliberate red herrings, which makes it an ideal proxy dataset to explore and study fixation effect and Einstellung paradigm from cognitive neuroscience in LLMs. In this paper we present the novel Only Connect Wall (OCW) dataset and report results from our evaluation of selected pre-trained language models and LLMs on creative problem solving tasks like grouping clue words by heterogeneous connections, and identifying correct open knowledge domain connections in respective groups. We synthetically generate two additional datasets: OCW-Randomized, OCW-WordNet to further analyze our red-herrings hypothesis in language models. The code and link to the dataset are available at https://github.com/TaatiTeam/OCW.


Differentiable Random Partition Models

arXiv.org Artificial Intelligence

Partitioning a set of elements into an unknown number of mutually exclusive subsets is essential in many machine learning problems. However, assigning elements, such as samples in a dataset or neurons in a network layer, to an unknown and discrete number of subsets is inherently non-differentiable, prohibiting end-to-end gradient-based optimization of parameters. We overcome this limitation by proposing a novel two-step method for inferring partitions, which allows its usage in variational inference tasks. This new approach enables reparameterized gradients with respect to the parameters of the new random partition model. Our method works by inferring the number of elements per subset and, second, by filling these subsets in a learned order. We highlight the versatility of our general-purpose approach on three different challenging experiments: variational clustering, inference of shared and independent generative factors under weak supervision, and multitask learning.


Hierarchical clustering with dot products recovers hidden tree structure

arXiv.org Machine Learning

In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.


Unveiling Safety Vulnerabilities of Large Language Models

arXiv.org Artificial Intelligence

As large language models become more prevalent, their possible harmful or inappropriate responses are a cause for concern. This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ, designed to provoke such harmful or inappropriate responses. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. Additionally, we introduce a novel automatic approach for identifying and naming vulnerable semantic regions - input semantic areas for which the model is likely to produce harmful outputs. This is achieved through the application of specialized clustering techniques that consider both the semantic similarity of the input attacks and the harmfulness of the model's responses. Automatically identifying vulnerable semantic regions enhances the evaluation of model weaknesses, facilitating targeted improvements to its safety mechanisms and overall reliability.


Kernel-based Joint Multiple Graph Learning and Clustering of Graph Signals

arXiv.org Artificial Intelligence

Within the context of Graph Signal Processing (GSP), Graph Learning (GL) is concerned with the inference of the graph's underlying structure from nodal observations. However, real-world data often contains diverse information, necessitating the simultaneous clustering and learning of multiple graphs. In practical applications, valuable node-specific covariates, represented as kernels, have been underutilized by existing graph signal clustering methods. In this letter, we propose a new framework, named Kernel-based joint Multiple GL and clustering of graph signals (KMGL), that leverages a multi-convex optimization approach. This allows us to integrate node-side information, construct low-pass filters, and efficiently solve the optimization problem. The experiments demonstrate that KMGL significantly enhances the robustness of GL and clustering, particularly in scenarios with high noise levels and a substantial number of clusters. These findings underscore the potential of KMGL for improving the performance of GSP methods in diverse, real-world applications.