Goto

Collaborating Authors

 Clustering


Machine Unlearning of Federated Clusters

arXiv.org Artificial Intelligence

Federated clustering (FC) is an unsupervised learning problem that arises in a number of practical applications, including personalized recommender and healthcare systems. With the adoption of recent laws ensuring the "right to be forgotten", the problem of machine unlearning for FC methods has become of significant importance. We introduce, for the first time, the problem of machine unlearning for FC, and propose an efficient unlearning mechanism for a customized secure FC framework. Our FC framework utilizes special initialization procedures that we show are well-suited for unlearning. To protect client data privacy, we develop the secure compressed multiset aggregation (SCMA) framework that addresses sparse secure federated learning (FL) problems encountered during clustering as well as more general problems. To simultaneously facilitate low communication complexity and secret sharing protocols, we integrate Reed-Solomon encoding with special evaluation points into our SCMA pipeline, and prove that the client communication cost is logarithmic in the vector dimension. Additionally, to demonstrate the benefits of our unlearning mechanism over complete retraining, we provide a theoretical analysis for the unlearning performance of our approach. Simulation results show that the new FC framework exhibits superior clustering performance compared to previously reported FC baselines when the cluster sizes are highly imbalanced. Compared to completely retraining K-means++ locally and globally for each removal request, our unlearning procedure offers an average speed-up of roughly 84x across seven datasets. Our implementation for the proposed method is available at https://github.com/thupchnsky/mufc. The availability of large volumes of user training data has contributed to the success of modern machine learning models. For example, most state-of-the-art computer vision models are trained on large-scale image datasets including Flickr (Thomee et al., 2016) and ImageNet (Deng et al., 2009).


Multi-Dialectal Representation Learning of Sinitic Phonology

arXiv.org Artificial Intelligence

Machine learning techniques have shown their competence for representing and reasoning in symbolic systems such as language and phonology. In Sinitic Historical Phonology, notable tasks that could benefit from machine learning include the comparison of dialects and reconstruction of proto-languages systems. Motivated by this, this paper provides an approach for obtaining multi-dialectal representations of Sinitic syllables, by constructing a knowledge graph from structured phonological data, then applying the BoxE technique from knowledge base learning. We applied unsupervised clustering techniques to the obtained representations to observe that the representations capture phonemic contrast from the input dialects. Furthermore, we trained classifiers to perform inference of unobserved Middle Chinese labels, showing the representations' potential for indicating archaic, proto-language features. The representations can be used for performing completion of fragmented Sinitic phonological knowledge bases, estimating divergences between different characters, or aiding the exploration and reconstruction of archaic features.


Are Neurons Actually Collapsed? On the Fine-Grained Structure in Neural Representations

arXiv.org Artificial Intelligence

Recent work has observed an intriguing ''Neural Collapse'' phenomenon in well-trained neural networks, where the last-layer representations of training samples with the same label collapse into each other. This appears to suggest that the last-layer representations are completely determined by the labels, and do not depend on the intrinsic structure of input distribution. We provide evidence that this is not a complete description, and that the apparent collapse hides important fine-grained structure in the representations. Specifically, even when representations apparently collapse, the small amount of remaining variation can still faithfully and accurately captures the intrinsic structure of input distribution. As an example, if we train on CIFAR-10 using only 5 coarse-grained labels (by combining two classes into one super-class) until convergence, we can reconstruct the original 10-class labels from the learned representations via unsupervised clustering. The reconstructed labels achieve $93\%$ accuracy on the CIFAR-10 test set, nearly matching the normal CIFAR-10 accuracy for the same architecture. We also provide an initial theoretical result showing the fine-grained representation structure in a simplified synthetic setting. Our results show concretely how the structure of input data can play a significant role in determining the fine-grained structure of neural representations, going beyond what Neural Collapse predicts.


On the Relationship Between RNN Hidden State Vectors and Semantic Ground Truth

arXiv.org Artificial Intelligence

We examine the assumption that the hidden-state vectors of recurrent neural networks (RNNs) tend to form clusters of semantically similar vectors, which we dub the clustering hypothesis. While this hypothesis has been assumed in the analysis of RNNs in recent years, its validity has not been studied thoroughly on modern neural network architectures. We examine the clustering hypothesis in the context of RNNs that were trained to recognize regular languages. This enables us to draw on perfect ground-truth automata in our evaluation, against which we can compare the RNN's accuracy and the distribution of the hidden-state vectors. We start with examining the (piecewise linear) separability of an RNN's hidden-state vectors into semantically different classes. We continue the analysis by computing clusters over the hidden-state vector space with multiple state-of-the-art unsupervised clustering approaches. We formally analyze the accuracy of computed clustering functions and the validity of the clustering hypothesis by determining whether clusters group semantically similar vectors to the same state in the ground-truth model. Our evaluation supports the validity of the clustering hypothesis in the majority of examined cases. We observed that the hidden-state vectors of well-trained RNNs are separable, and that the unsupervised clustering techniques succeed in finding clusters of similar state vectors.


A Restarted Large-Scale Spectral Clustering with Self-Guiding and Block Diagonal Representation

arXiv.org Artificial Intelligence

Spectral clustering is one of the most popular unsupervised machine learning methods. Constructing similarity matrix is crucial to this type of method. In most existing works, the similarity matrix is computed once for all or is updated alternatively. However, the former is difficult to reflect comprehensive relationships among data points, and the latter is time-consuming and is even infeasible for large-scale problems. In this work, we propose a restarted clustering framework with self-guiding and block diagonal representation. An advantage of the strategy is that some useful clustering information obtained from previous cycles could be preserved as much as possible. To the best of our knowledge, this is the first work that applies restarting strategy to spectral clustering. The key difference is that we reclassify the samples in each cycle of our method, while they are classified only once in existing methods. To further release the overhead, we introduce a block diagonal representation with Nystr\"{o}m approximation for constructing the similarity matrix. Theoretical results are established to show the rationality of inexact computations in spectral clustering. Comprehensive experiments are performed on some benchmark databases, which show the superiority of our proposed algorithms over many state-of-the-art algorithms for large-scale problems. Specifically, our framework has a potential boost for clustering algorithms and works well even using an initial guess chosen randomly.


Event Detection from Social Media Stream: Methods, Datasets and Opportunities

arXiv.org Artificial Intelligence

Social media streams contain large and diverse amount of information, ranging from daily-life stories to the latest global and local events and news. Twitter, especially, allows a fast spread of events happening real time, and enables individuals and organizations to stay informed of the events happening now. Event detection from social media data poses different challenges from traditional text and is a research area that has attracted much attention in recent years. In this paper, we survey a wide range of event detection methods for Twitter data stream, helping readers understand the recent development in this area. We present the datasets available to the public. Furthermore, a few research opportunities


Building Trust Profiles in Conditionally Automated Driving

arXiv.org Artificial Intelligence

Trust is crucial for ensuring the safety, security, and widespread adoption of automated vehicles (AVs), and if trust is lacking, drivers and the public may not be willing to use them. This research seeks to investigate trust profiles in order to create personalized experiences for drivers in AVs. This technique helps in better understanding drivers' dynamic trust from a persona's perspective. The study was conducted in a driving simulator where participants were requested to take over control from automated driving in three conditions that included a control condition, a false alarm condition, and a miss condition with eight takeover requests (TORs) in different scenarios. Drivers' dispositional trust, initial learned trust, dynamic trust, personality, and emotions were measured. We identified three trust profiles (i.e., believers, oscillators, and disbelievers) using a K-means clustering model. In order to validate this model, we built a multinomial logistic regression model based on SHAP explainer that selected the most important features to predict the trust profiles with an F1-score of 0.90 and accuracy of 0.89. We also discussed how different individual factors influenced trust profiles which helped us understand trust dynamics better from a persona's perspective. Our findings have important implications for designing a personalized in-vehicle trust monitoring and calibrating system to adjust drivers' trust levels in order to improve safety and experience in automated driving.


cuSLINK: Single-linkage Agglomerative Clustering on the GPU

arXiv.org Artificial Intelligence

In this paper, we propose cuSLINK, a novel and state-of-the-art reformulation of the SLINK algorithm on the GPU which requires only $O(Nk)$ space and uses a parameter $k$ to trade off space and time. We also propose a set of novel and reusable building blocks that compose cuSLINK. These building blocks include highly optimized computational patterns for $k$-NN graph construction, spanning trees, and dendrogram cluster extraction. We show how we used our primitives to implement cuSLINK end-to-end on the GPU, further enabling a wide range of real-world data mining and machine learning applications that were once intractable. In addition to being a primary computational bottleneck in the popular HDBSCAN algorithm, the impact of our end-to-end cuSLINK algorithm spans a large range of important applications, including cluster analysis in social and computer networks, natural language processing, and computer vision. Users can obtain cuSLINK at https://docs.rapids.ai/api/cuml/latest/api/#agglomerative-clustering


Non-parametric online market regime detection and regime clustering for multidimensional and path-dependent data structures

arXiv.org Machine Learning

In this work we present a non-parametric online market regime detection method for multidimensional data structures using a path-wise two-sample test derived from a maximum mean discrepancy-based similarity metric on path space that uses rough path signatures as a feature map. The latter similarity metric has been developed and applied as a discriminator in recent generative models for small data environments, and has been optimised here to the setting where the size of new incoming data is particularly small, for faster reactivity. On the same principles, we also present a path-wise method for regime clustering which extends our previous work [HIM21]. The presented regime clustering techniques, as in [HIM21], were designed as ex-ante market analysis tools that can identify periods of approximatively similar market activity, but the new results also apply to path-wise, high dimensional-, and to non-Markovian settings as well as to data structures that exhibit autocorrelation. We demonstrate our clustering tools on easily verifiable synthetic datasets of increasing complexity, and also show how the outlined regime detection techniques can be used as fast on-line automatic regime change detectors or as outlier detection tools, including a fully automated pipeline. Finally, we apply the fine-tuned algorithms to real-world historical data including high-dimensional baskets of equities and the recent price evolution of crypto assets, and we show that our methodology swiftly and accurately indicated historical periods of market turmoil.


COMPASS: Unsupervised and Online Clustering of Complex Human Activities from Smartphone Sensors

arXiv.org Artificial Intelligence

Modern mobile devices are able to provide context-aware and personalized services to the users, by leveraging on their sensing capabilities to infer the activity and situation in which a person is currently involved. Current solutions for context-recognition rely on annotated data and experts’ knowledge to predict the user context. In addition, their prediction ability is strongly limited to the set of situations considered during the model training or definition. However, in a mobile environment, the user context continuously evolves, and it cannot be merely restricted to a set of predefined classes. To overcome these limitations, we propose COMPASS, a novel unsupervised and online clustering algorithm aimed at identifying the user context in mobile environments based on the stream of high-dimensional data generated by smartphone sensors. COMPASScan distinguish an arbitrary number of user’s contexts from the sensors’ data, without defining a priori the collection of expected situations. This key feature makes it a general-purpose solution to provide context-aware features to mobile devices, supporting a broad set of applications. Experimental results on 18 synthetic and 2 real-world datasets show that COMPASS correctly identifies the user context from the sensors’ data stream, and outperforms the state-of-the-art solutions in terms of both clusters configuration and purity. Eventually, we evaluate its performances in terms of execution time and the results show that COMPASS can process 1000 high-dimensional samples in less than 20 seconds, while the reference solutions require about 60 minutes to evaluate the entire dataset. Keywords: Context-awareness, Unsupervised Machine Learning, Online Clustering, Mobile Computing