Clustering
New bootstrap tests for categorical time series. A comparative study
López-Oriona, Ángel, Fernández, José Antonio Vilar, D'Urso, Pierpaolo
The problem of testing the equality of the generating processes of two categorical time series is addressed in this work. To this aim, we propose three tests relying on a dissimilarity measure between categorical processes. Particular versions of these tests are constructed by considering three specific distances evaluating discrepancy between the marginal distributions and the serial dependence patterns of both processes. Proper estimates of these dissimilarities are an essential element of the constructed tests, which are based on the bootstrap. Specifically, a parametric bootstrap method assuming the true generating models and extensions of the moving blocks bootstrap and the stationary bootstrap are considered. The approaches are assessed in a broad simulation study including several types of categorical models with different degrees of complexity. Advantages and disadvantages of each one of the methods are properly discussed according to their behavior under the null and the alternative hypothesis. The impact that some important input parameters have on the results of the tests is also analyzed. An application involving biological sequences highlights the usefulness of the proposed techniques.
A supervised active learning method for identifying critical nodes in Wireless Sensor Network
Ojaghi, Behnam, Dehshibi, Mohammad Mahdi
Energy Efficiency of a wireless sensor network (WSN) relies on its main characteristics, including hop-number, user's location, allocated power, and relay. Identifying nodes, which have more impact on these characteristics, is, however, subject to a substantial computational overhead and energy consumption. In this paper, we proposed an active learning approach to address the computational overhead of identifying critical nodes in a WSN. The proposed approach can overcome biasing in identifying non-critical nodes and needs much less effort in fine-tuning to adapt to the dynamic nature of WSN. This method benefits from the cooperation of clustering and classification modules to iteratively decrease the required number of data in a typical supervised learning scenario and to increase the accuracy in the presence of uninformative examples, i.e., non-critical nodes. Experiments show that the proposed method has more flexibility, compared to the state-of-the-art, to be employed in large scale WSN environments, the fifth-generation mobile networks (5G), and massively distributed IoT (i.e., sensor networks), where it can prolong the network lifetime.
Cluster Flow: how a hierarchical clustering layer make allows deep-NNs more resilient to hacking, more human-like and easily implements relational reasoning
Despite the huge recent breakthroughs in neural networks (NNs) for artificial intelligence (specifically deep convolutional networks) such NNs do not achieve human-level performance: they can be hacked by images that would fool no human and lack `common sense'. It has been argued that a basis of human-level intelligence is mankind's ability to perform relational reasoning: the comparison of different objects, measuring similarity, grasping of relations between objects and the converse, figuring out the odd one out in a set of objects. Mankind can even do this with objects they have never seen before. Here we show how ClusterFlow, a semi-supervised hierarchical clustering framework can operate on trained NNs utilising the rich multi-dimensional class and feature data found at the pre-SoftMax layer to build a hyperspacial map of classes/features and this adds more human-like functionality to modern deep convolutional neural networks. We demonstrate this with 3 tasks. 1. the statistical learning based `mistakes' made by infants when attending to images of cats and dogs. 2. improving both the resilience to hacking images and the accurate measure of certainty in deep-NNs. 3. Relational reasoning over sets of images, including those not known to the NN nor seen before. We also demonstrate that ClusterFlow can work on non-NN data and deal with missing data by testing it on a Chemistry dataset. This work suggests that modern deep NNs can be made more human-like without re-training of the NNs. As it is known that some methods used in deep and convolutional NNs are not biologically plausible or perhaps even the best approach: the ClusterFlow framework can sit on top of any NN and will be a useful tool to add as NNs are improved in this regard.
Semantic Frame Induction with Deep Metric Learning
Yamada, Kosuke, Sasano, Ryohei, Takeda, Koichi
Recent studies have demonstrated the usefulness of contextualized word embeddings in unsupervised semantic frame induction. However, they have also revealed that generic contextualized embeddings are not always consistent with human intuitions about semantic frames, which causes unsatisfactory performance for frame induction based on contextualized embeddings. In this paper, we address supervised semantic frame induction, which assumes the existence of frame-annotated data for a subset of predicates in a corpus and aims to build a frame induction model that leverages the annotated data. We propose a model that uses deep metric learning to fine-tune a contextualized embedding model, and we apply the fine-tuned contextualized embeddings to perform semantic frame induction. Our experiments on FrameNet show that fine-tuning with deep metric learning considerably improves the clustering evaluation scores, namely, the B-cubed F-score and Purity F-score, by about 8 points or more. We also demonstrate that our approach is effective even when the number of training instances is small.
Total Variation Graph Neural Networks
Hansen, Jonas Berg, Bianchi, Filippo Maria
Recently proposed Graph Neural Networks (GNNs) for vertex clustering are trained with an unsupervised minimum cut objective, approximated by a Spectral Clustering (SC) relaxation. However, the SC relaxation is loose and, while it offers a closed-form solution, it also yields overly smooth cluster assignments that poorly separate the vertices. In this paper, we propose a GNN model that computes cluster assignments by optimizing a tighter relaxation of the minimum cut based on graph total variation (GTV). The cluster assignments can be used directly to perform vertex clustering or to implement graph pooling in a graph classification framework. Our model consists of two core components: i) a message-passing layer that minimizes the $\ell_1$ distance in the features of adjacent vertices, which is key to achieving sharp transitions between clusters; ii) an unsupervised loss function that minimizes the GTV of the cluster assignments while ensuring balanced partitions. Experimental results show that our model outperforms other GNNs for vertex clustering and graph classification.
Extreme Classification for Answer Type Prediction in Question Answering
Semantic answer type prediction (SMART) is known to be a useful step towards effective question answering (QA) systems. The SMART task involves predicting the top-$k$ knowledge graph (KG) types for a given natural language question. This is challenging due to the large number of types in KGs. In this paper, we propose use of extreme multi-label classification using Transformer models (XBERT) by clustering KG types using structural and semantic features based on question text. We specifically improve the clustering stage of the XBERT pipeline using textual and structural features derived from KGs. We show that these features can improve end-to-end performance for the SMART task, and yield state-of-the-art results.
highway2vec -- representing OpenStreetMap microregions with respect to their road network characteristics
Leśniara, Kacper, Szymański, Piotr
Recent years brought advancements in using neural networks for representation learning of various language or visual phenomena. New methods freed data scientists from hand-crafting features for common tasks. Similarly, problems that require considering the spatial variable can benefit from pretrained map region representations instead of manually creating feature tables that one needs to prepare to solve a task. However, very few methods for map area representation exist, especially with respect to road network characteristics. In this paper, we propose a method for generating microregions' embeddings with respect to their road infrastructure characteristics. We base our representations on OpenStreetMap road networks in a selection of cities and use the H3 spatial index to allow reproducible and scalable representation learning. We obtained vector representations that detect how similar map hexagons are in the road networks they contain. Additionally, we observe that embeddings yield a latent space with meaningful arithmetic operations. Finally, clustering methods allowed us to draft a high-level typology of obtained representations. We are confident that this contribution will aid data scientists working on infrastructure-related prediction tasks with spatial variables.
Unsupervised classification of fully kinetic simulations of plasmoid instability using Self-Organizing Maps (SOMs)
Köhne, Sophia, Boella, Elisabetta, Innocenti, Maria Elena
The growing amount of data produced by simulations and observations of space physics processes encourages the use of methods rooted in Machine Learning for data analysis and physical discovery. We apply a clustering method based on Self-Organizing Maps (SOM) to fully kinetic simulations of plasmoid instability, with the aim of assessing its suitability as a reliable analysis tool for both simulated and observed data. We obtain clusters that map well, a posteriori, to our knowledge of the process: the clusters clearly identify the inflow region, the inner plasmoid region, the separatrices, and regions associated with plasmoid merging. SOM-specific analysis tools, such as feature maps and Unified Distance Matrix, provide one with valuable insights into both the physics at work and specific spatial regions of interest. The method appears as a promising option for the analysis of data, both from simulations and from observations, and could also potentially be used to trigger the switch to different simulation models or resolution in coupled codes for space simulations.
Improving the Utility of Differentially Private Clustering through Dynamical Processing
Byun, Junyoung, Choi, Yujin, Lee, Jaewook
This study aims to alleviate the trade-off between utility and privacy in the task of differentially private clustering. Existing works focus on simple clustering methods, which show poor clustering performance for non-convex clusters. By utilizing Morse theory, we hierarchically connect the Gaussian sub-clusters to fit complex cluster distributions. Because differentially private sub-clusters are obtained through the existing methods, the proposed method causes little or no additional privacy loss. We provide a theoretical background that implies that the proposed method is inductive and can achieve any desired number of clusters. Experiments on various datasets show that our framework achieves better clustering performance at the same privacy level, compared to the existing methods.
A Review and Evaluation of Elastic Distance Functions for Time Series Clustering
Holder, Chris, Middlehurst, Matthew, Bagnall, Anthony
Time series clustering is the act of grouping time series data without recourse to a label. Algorithms that cluster time series can be classified into two groups: those that employ a time series specific distance measure; and those that derive features from time series. Both approaches usually rely on traditional clustering algorithms such as $k$-means. Our focus is on distance based time series that employ elastic distance measures, i.e. distances that perform some kind of realignment whilst measuring distance. We describe nine commonly used elastic distance measures and compare their performance with k-means and k-medoids clustering. Our findings are surprising. The most popular technique, dynamic time warping (DTW), performs worse than Euclidean distance with k-means, and even when tuned, is no better. Using k-medoids rather than k-means improved the clusterings for all nine distance measures. DTW is not significantly better than Euclidean distance with k-medoids. Generally, distance measures that employ editing in conjunction with warping perform better, and one distance measure, the move-split-merge (MSM) method, is the best performing measure of this study. We also compare to clustering with DTW using barycentre averaging (DBA). We find that DBA does improve DTW k-means, but that the standard DBA is still worse than using MSM. Our conclusion is to recommend MSM with k-medoids as the benchmark algorithm for clustering time series with elastic distance measures. We provide implementations in the aeon toolkit, results and guidance on reproducing results on the associated GitHub repository.