Goto

Collaborating Authors

 Clustering


Categorical anomaly detection in heterogeneous data using minimum description length clustering

arXiv.org Artificial Intelligence

Two examples of anomaly detection based on MDL have been been proposed for categorical data based on the minimum description studied and shown to perform well: the OC3 algorithm [21] based length (MDL) principle. However, they can be ineffective when on an itemset mining technique called Krimp [26], and the CompreX detecting anomalies in heterogeneous datasets representing a mixture algorithm [2]. Broadly speaking, both take a similar approach: of different sources, such as security scenarios in which system first, a model H of the data that compresses it well is found using a and user processes have distinct behavior patterns. We propose a heuristic search, balancing the model complexity L(H) (number of meta-algorithm for enhancing any MDL-based anomaly detection bits required to compress the model structure/parameters) against model to deal with heterogeneous data by fitting a mixture model the data complexity L(X H) (number of bits required to compress to the data, via a variant of k-means clustering. Our experimental the data given the model). Once such a model H is found, we assign results show that using a discrete mixture model provides competitive to each object x X a score corresponding to the object's performance relative to two previous anomaly detection compressed size L(x H) given the selected model. Intuitively, if the algorithms, while mixtures of more sophisticated models yield further model accurately characterizes the data as a whole, records that are gains, on both synthetic datasets and realistic datasets from a representative will compress well, yielding a low anomaly score, security scenario.


Consistent Estimation of Identifiable Nonparametric Mixture Models from Grouped Observations

arXiv.org Machine Learning

Recent research has established sufficient conditions for finite mixture models to be identifiable from grouped observations. These conditions allow the mixture components to be nonparametric and have substantial (or even total) overlap. This work proposes an algorithm that consistently estimates any identifiable mixture model from grouped observations. Our analysis leverages an oracle inequality for weighted kernel density estimators of the distribution on groups, together with a general result showing that consistent estimation of the distribution on groups implies consistent estimation of mixture components. A practical implementation is provided for paired observations, and the approach is shown to outperform existing methods, especially when mixture components overlap significantly.


Markov Random Geometric Graph (MRGG): A Growth Model for Temporal Dynamic Networks

arXiv.org Machine Learning

We introduce Markov Random Geometric Graphs (MRGGs), a growth model for temporal dynamic networks. It is based on a Markovian latent space dynamic: consecutive latent points are sampled on the Euclidean Sphere using an unknown Markov kernel; and two nodes are connected with a probability depending on a unknown function of their latent geodesic distance. More precisely, at each stamp-time k we add a latent point X k sampled by jumping from the previous one X k--1 in a direction chosen uniformly Y k and with a length r k drawn from an unknown distribution called the latitude function. The connection probabilities between each pair of nodes are equal to the envelope function of the distance between these two latent points. We provide theoretical guarantees for the non-parametric estimation of the latitude and the envelope functions. We propose an efficient algorithm that achieves those non-parametric estimation tasks based on an ad-hoc Hierarchical Agglomerative Clustering approach, and we deploy this analysis on a real data-set given by exchange of messages on a social network.


A general framework for label-efficient online evaluation with asymptotic guarantees

arXiv.org Machine Learning

Achieving statistically significant evaluation with passive sampling of test data is challenging in settings such as extreme classification and record linkage, where significant class imbalance is prevalent. Adaptive importance sampling focuses labeling on informative regions of the instance space, however it breaks data independence assumptions - commonly required for asymptotic guarantees that assure estimates approximate population performance and provide practical confidence intervals. In this paper we develop an adaptive importance sampling framework for supervised evaluation that defines a sequence of proposal distributions given a user-defined discriminative model of p(y|x) and a generalized performance measure to evaluate. Under verifiable conditions on the model and performance measure, we establish strong consistency and a (martingale) central limit theorem for resulting performance estimates. We instantiate our framework with worked examples given stochastic or deterministic label oracle access. Both examples leverage Dirichlet-tree models for practical online evaluation, with the deterministic case achieving asymptotic optimality. Experiments on seven datasets demonstrate an average mean-squared error superior to state-of-the-art samplers on fixed label budgets.


Scoring and Assessment in Medical VR Training Simulators with Dynamic Time Series Classification

arXiv.org Artificial Intelligence

This research proposes and evaluates scoring and assessment methods for Virtual Reality (VR) training simulators. VR simulators capture detailed n-dimensional human motion data which is useful for performance analysis. Custom made medical haptic VR training simulators were developed and used to record data from 271 trainees of multiple clinical experience levels. DTW Multivariate Prototyping (DTW-MP) is proposed. VR data was classified as Novice, Intermediate or Expert. Accuracy of algorithms applied for time-series classification were: dynamic time warping 1-nearest neighbor (DTW-1NN) 60%, nearest centroid SoftDTW classification 77.5%, Deep Learning: ResNet 85%, FCN 75%, CNN 72.5% and MCDCNN 28.5%. Expert VR data recordings can be used for guidance of novices. Assessment feedback can help trainees to improve skills and consistency. Motion analysis can identify different techniques used by individuals. Mistakes can be detected dynamically in real-time, raising alarms to prevent injuries.


Bandit-PAM: Almost Linear Time $k$-Medoids Clustering via Multi-Armed Bandits

arXiv.org Artificial Intelligence

Clustering is a ubiquitous task in data science. Compared to the commonly used $k$-means clustering algorithm, $k$-medoids clustering algorithms require the cluster centers to be actual data points and support arbitrary distance metrics, allowing for greater interpretability and the clustering of structured objects. Current state-of-the-art $k$-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are quadratic in the dataset size $n$ for each iteration, being prohibitively expensive for large datasets. We propose Bandit-PAM, a randomized algorithm inspired by techniques from multi-armed bandits, that significantly improves the computational efficiency of PAM. We theoretically prove that Bandit-PAM reduces the complexity of each PAM iteration from $O(n^2)$ to $O(n \log n)$ and returns the same results with high probability, under assumptions on the data that often hold in practice. We empirically validate our results on several large-scale real-world datasets, including a coding exercise submissions dataset from Code.org, the 10x Genomics 68k PBMC single-cell RNA sequencing dataset, and the MNIST handwritten digits dataset. We observe that Bandit-PAM returns the same results as PAM while performing up to 200x fewer distance computations. The improvements demonstrated by Bandit-PAM enable $k$-medoids clustering on a wide range of applications, including identifying cell types in large-scale single-cell data and providing scalable feedback for students learning computer science online. We also release Python and C++ implementations of our algorithm.


Fuzzy c-Means Clustering for Persistence Diagrams

arXiv.org Machine Learning

Persistence diagrams, a key tool in the field of Topological Data Analysis, concisely represent the topology of a point cloud. Most current methods to integrate persistence diagrams into machine learning either require prior knowledge of a ground-truth topology or map them into a feature vector, offering a trade-off between information loss and invoking the curse of dimensionality. In this paper we give an algorithm for Fuzzy c-Means (FCM) clustering directly on the space of persistence diagrams, enabling unsupervised learning that automatically captures the topological structure of data, with no prior knowledge or additional processing of persistence diagrams. We prove the same convergence guarantees as traditional FCM clustering: that any convergent subsequence of our algorithm tends to a local minimum or saddle point of the cost function. We end by presenting experiments that demonstrate our algorithm can successfully cluster transformed datasets from materials science where comparable Wasserstein barycentre clustering algorithms fail, whilst also running at least an order of magnitude faster.


Interpretable Visualizations with Differentiating Embedding Networks

arXiv.org Machine Learning

We present a visualization algorithm based on a novel unsupervised Siamese neural network training regime and loss function, called Differentiating Embedding Networks (DEN). The Siamese neural network finds differentiating or similar features between specific pairs of samples in a dataset, and uses these features to embed the dataset in a lower dimensional space where it can be visualized. Unlike existing visualization algorithms such as UMAP or $t$-SNE, DEN is parametric, meaning it can be interpreted by techniques such as SHAP. To interpret DEN, we create an end-to-end parametric clustering algorithm on top of the visualization, and then leverage SHAP scores to determine which features in the sample space are important for understanding the structures shown in the visualization based on the clusters found. We compare DEN visualizations with existing techniques on a variety of datasets, including image and scRNA-seq data. We then show that our clustering algorithm performs similarly to the state of the art despite not having prior knowledge of the number of clusters, and sets a new state of the art on FashionMNIST. Finally, we demonstrate finding differentiating features of a dataset. Code available at https://github.com/isaacrob/DEN


Similarity-based Classification: Connecting Similarity Learning to Binary Classification

arXiv.org Machine Learning

In real-world classification problems, pairwise supervision (i.e., a pair of patterns with a binary label indicating whether they belong to the same class or not) can often be obtained at a lower cost than ordinary class labels. Similarity learning is a general framework to utilize such pairwise supervision to elicit useful representations by inferring the relationship between two data points, which encompasses various important preprocessing tasks such as metric learning, kernel learning, graph embedding, and contrastive representation learning. Although elicited representations are expected to perform well in downstream tasks such as classification, little theoretical insight has been given in the literature so far. In this paper, we reveal that a specific formulation of similarity learning is strongly related to the objective of binary classification, which spurs us to learn a binary classifier without ordinary class labels---by fitting the product of real-valued prediction functions of pairwise patterns to their similarity. Our formulation of similarity learning does not only generalize many existing ones, but also admits an excess risk bound showing an explicit connection to classification. Finally, we empirically demonstrate the practical usefulness of the proposed method on benchmark datasets.


A novel sentence embedding based topic detection method for micro-blog

arXiv.org Machine Learning

Topic detection is a challenging task, especially without knowing the exact number of topics. In this paper, we present a novel approach based on neural network to detect topics in the micro-blogging dataset. We use an unsupervised neural sentence embedding model to map the blogs to an embedding space. Our model is a weighted power mean word embedding model, and the weights are calculated by attention mechanism. Experimental result shows our embedding method performs better than baselines in sentence clustering. In addition, we propose an improved clustering algorithm referred as relationship-aware DBSCAN (RADBSCAN). It can discover topics from a micro-blogging dataset, and the topic number depends on dataset character itself. Moreover, in order to solve the problem of parameters sensitive, we take blog forwarding relationship as a bridge of two independent clusters. Finally, we validate our approach on a dataset from sina micro-blog. The result shows that we can detect all the topics successfully and extract keywords in each topic.