Clustering
Rethinking cluster-conditioned diffusion models
Adaloglou, Nikolas, Kaiser, Tim, Michels, Felix, Kollmann, Markus
We present a comprehensive experimental study on image-level conditioning for diffusion models using cluster assignments. We elucidate how individual components regarding image clustering impact image synthesis across three datasets. By combining recent advancements from image clustering and diffusion models, we show that, given the optimal cluster granularity with respect to image synthesis (visual groups), cluster-conditioning can achieve state-of-the-art FID (i.e. 1.67, 2.17 on CIFAR10 and CIFAR100 respectively), while attaining a strong training sample efficiency. Finally, we propose a novel method to derive an upper cluster bound that reduces the search space of the visual groups using solely feature-based clustering. Unlike existing approaches, we find no significant connection between clustering and cluster-conditional image generation. The code and cluster assignments will be released.
Attacks Against Mobility Prediction in 5G Networks
Atiiq, Syafiq Al, Yuan, Yachao, Gehrmann, Christian, Sternby, Jakob, Barriga, Luis
The $5^{th}$ generation of mobile networks introduces a new Network Function (NF) that was not present in previous generations, namely the Network Data Analytics Function (NWDAF). Its primary objective is to provide advanced analytics services to various entities within the network and also towards external application services in the 5G ecosystem. One of the key use cases of NWDAF is mobility trajectory prediction, which aims to accurately support efficient mobility management of User Equipment (UE) in the network by allocating ``just in time'' necessary network resources. In this paper, we show that there are potential mobility attacks that can compromise the accuracy of these predictions. In a semi-realistic scenario with 10,000 subscribers, we demonstrate that an adversary equipped with the ability to hijack cellular mobile devices and clone them can significantly reduce the prediction accuracy from 75\% to 40\% using just 100 adversarial UEs. While a defense mechanism largely depends on the attack and the mobility types in a particular area, we prove that a basic KMeans clustering is effective in distinguishing legitimate and adversarial UEs.
Graph Construction with Flexible Nodes for Traffic Demand Prediction
Hou, Jinyan, Liu, Shan, Zhang, Ya, Qin, Haotong
Graph neural networks (GNNs) have been widely applied in traffic demand prediction, and transportation modes can be divided into station-based mode and free-floating traffic mode. Existing research in traffic graph construction primarily relies on map matching to construct graphs based on the road network. However, the complexity and inhomogeneity of data distribution in free-floating traffic demand forecasting make road network matching inflexible. To tackle these challenges, this paper introduces a novel graph construction method tailored to free-floating traffic mode. We propose a novel density-based clustering algorithm (HDPC-L) to determine the flexible positioning of nodes in the graph, overcoming the computational bottlenecks of traditional clustering algorithms and enabling effective handling of large-scale datasets. Furthermore, we extract valuable information from ridership data to initialize the edge weights of GNNs. Comprehensive experiments on two real-world datasets, the Shenzhen bike-sharing dataset and the Haikou ride-hailing dataset, show that the method significantly improves the performance of the model. On average, our models show an improvement in accuracy of around 25\% and 19.5\% on the two datasets. Additionally, it significantly enhances computational efficiency, reducing training time by approximately 12% and 32.5% on the two datasets. We make our code available at https://github.com/houjinyan/HDPC-L-ODInit.
Impact of network topology on the performance of Decentralized Federated Learning
Palmieri, Luigi, Boldrini, Chiara, Valerio, Lorenzo, Passarella, Andrea, Conti, Marco
Fully decentralized learning is gaining momentum for training AI models at the Internet's edge, addressing infrastructure challenges and privacy concerns. In a decentralized machine learning system, data is distributed across multiple nodes, with each node training a local model based on its respective dataset. The local models are then shared and combined to form a global model capable of making accurate predictions on new data. Our exploration focuses on how different types of network structures influence the spreading of knowledge - the process by which nodes incorporate insights gained from learning patterns in data available on other nodes across the network. Specifically, this study investigates the intricate interplay between network structure and learning performance using three network topologies and six data distribution methods. These methods consider different vertex properties, including degree centrality, betweenness centrality, and clustering coefficient, along with whether nodes exhibit high or low values of these metrics. Our findings underscore the significance of global centrality metrics (degree, betweenness) in correlating with learning performance, while local clustering proves less predictive. We highlight the challenges in transferring knowledge from peripheral to central nodes, attributed to a dilution effect during model aggregation. Additionally, we observe that central nodes exert a pull effect, facilitating the spread of knowledge. In examining degree distribution, hubs in Barabasi-Albert networks positively impact learning for central nodes but exacerbate dilution when knowledge originates from peripheral nodes. Finally, we demonstrate the formidable challenge of knowledge circulation outside of segregated communities.
Spot the bot: Coarse-Grained Partition of Semantic Paths for Bots and Humans
Gromov, Vasilii A., Kogan, Alexandra S.
Nowadays, technology is rapidly advancing: bots are writing comments, articles, and reviews. Due to this fact, it is crucial to know if the text was written by a human or by a bot. This paper focuses on comparing structures of the coarse-grained partitions of semantic paths for human-written and bot-generated texts. We compare the clusterizations of datasets of n-grams from literary texts and texts generated by several bots. The hypothesis is that the structures and clusterizations are different. Our research supports the hypothesis. As the semantic structure may be different for different languages, we investigate Russian, English, German, and Vietnamese languages.
SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning
Zampierin, Luca, Hacene, Ghouthi Boukli, Nguyen, Bac, Ravanelli, Mirco
Self-supervised learning (SSL) has achieved remarkable success across various speech-processing tasks. To enhance its efficiency, previous works often leverage the use of compression techniques. A notable recent attempt is DPHuBERT, which applies joint knowledge distillation (KD) and structured pruning to learn a significantly smaller SSL model. In this paper, we contribute to this research domain by introducing SKILL, a novel method that conducts distillation across groups of layers instead of distilling individual arbitrarily selected layers within the teacher network. The identification of the layers to distill is achieved through a hierarchical clustering procedure applied to layer similarity measures. Extensive experiments demonstrate that our distilled version of WavLM Base+ not only outperforms DPHuBERT but also achieves state-of-the-art results in the 30M parameters model class across several SUPERB tasks.
Self Supervised Correlation-based Permutations for Multi-View Clustering
Eisenberg, Ran, Svirsky, Jonathan, Lindenbaum, Ofir
Fusing information from different modalities can enhance data analysis tasks, including clustering. However, existing multi-view clustering (MVC) solutions are limited to specific domains or rely on a suboptimal and computationally demanding two-stage procedure of representation and clustering. We propose an end-to-end deep learning-based MVC framework for general data (image, tabular, etc.). Our approach involves learning meaningful fused data representations with a novel permutation-based canonical correlation objective. Concurrently, we learn cluster assignments by identifying consistent pseudo-labels across multiple views. We demonstrate the effectiveness of our model using ten MVC benchmark datasets. Theoretically, we show that our model approximates the supervised linear discrimination analysis (LDA) representation. Additionally, we provide an error bound induced by false-pseudo label annotations.
Deep Contrastive Graph Learning with Clustering-Oriented Guidance
Chen, Mulin, Wang, Bocheng, Li, Xuelong
Graph Convolutional Network (GCN) has exhibited remarkable potential in improving graph-based clustering. To handle the general clustering scenario without a prior graph, these models estimate an initial graph beforehand to apply GCN. Throughout the literature, we have witnessed that 1) most models focus on the initial graph while neglecting the original features. Therefore, the discriminability of the learned representation may be corrupted by a low-quality initial graph; 2) the training procedure lacks effective clustering guidance, which may lead to the incorporation of clustering-irrelevant information into the learned graph. To tackle these problems, the Deep Contrastive Graph Learning (DCGL) model is proposed for general data clustering. Specifically, we establish a pseudo-siamese network, which incorporates auto-encoder with GCN to emphasize both the graph structure and the original features. On this basis, feature-level contrastive learning is introduced to enhance the discriminative capacity, and the relationship between samples and centroids is employed as the clustering-oriented guidance. Afterward, a two-branch graph learning mechanism is designed to extract the local and global structural relationships, which are further embedded into a unified graph under the cluster-level contrastive guidance. Experimental results on several benchmark datasets demonstrate the superiority of DCGL against state-of-the-art algorithms.
From Large to Small Datasets: Size Generalization for Clustering Algorithm Selection
Chatziafratis, Vaggos, Karmarkar, Ishani, Vitercik, Ellen
In clustering algorithm selection, we are given a massive dataset and must efficiently select which clustering algorithm to use. We study this problem in a semi-supervised setting, with an unknown ground-truth clustering that we can only access through expensive oracle queries. Ideally, the clustering algorithm's output will be structurally close to the ground truth. We approach this problem by introducing a notion of size generalization for clustering algorithm accuracy. We identify conditions under which we can (1) subsample the massive clustering instance, (2) evaluate a set of candidate algorithms on the smaller instance, and (3) guarantee that the algorithm with the best accuracy on the small instance will have the best accuracy on the original big instance. We provide theoretical size generalization guarantees for three classic clustering algorithms: single-linkage, k-means++, and (a smoothed variant of) Gonzalez's k-centers heuristic. We validate our theoretical analysis with empirical results, observing that on real-world clustering instances, we can use a subsample of as little as 5% of the data to identify which algorithm is best on the full dataset.
Convergence Analysis of Blurring Mean Shift
Yamasaki, Ryoya, Tanaka, Toshiyuki
Blurring mean shift (BMS) algorithm, a variant of the mean shift algorithm, is a kernel-based iterative method for data clustering, where data points are clustered according to their convergent points via iterative blurring. In this paper, we analyze convergence properties of the BMS algorithm by leveraging its interpretation as an optimization procedure, which is known but has been underutilized in existing convergence studies. Whereas existing results on convergence properties applicable to multi-dimensional data only cover the case where all the blurred data point sequences converge to a single point, this study provides a convergence guarantee even when those sequences can converge to multiple points, yielding multiple clusters. This study also shows that the convergence of the BMS algorithm is fast by further leveraging geometrical characterization of the convergent points.