Clustering
Interpretable Machine Learning for Discovery: Statistical Challenges \& Opportunities
Allen, Genevera I., Gan, Luqin, Zheng, Lili
Machine learning systems have gained widespread use in science, technology, and society. Given the increasing number of high-stakes machine learning applications and the growing complexity of machine learning models, many have advocated for interpretability and explainability to promote understanding and trust in machine learning results (Rasheed et al., 2022, Toreini et al., 2020, Broderick et al., 2023). In response, there has been a recent explosion of research on Interpretable Machine Learning (IML), mostly focusing on new techniques to interpret black-box systems; see Molnar (2022), Lipton (2018), Guidotti et al. (2018), Doshi-Velez & Kim (2017), Du et al. (2019), Murdoch et al. (2019), Carvalho et al. (2019) for recent reviews of the IML and explainable artificial intelligence literature. While most of these interpretability techniques were not necessarily designed for this purpose, they are increasingly being used to mine large and complex data sets to generate new insights (Roscher et al., 2020). These so-called data-driven discoveries are especially important to advance data-rich fields in science, technology, and medicine. While prior reviews focus mainly on IML techniques, we primarily review how IML methods promote data-driven discoveries, challenges associated with this task, and related new research opportunities at the intersection of machine learning and statistics. In the sciences and beyond, IML techniques are routinely employed to make new discoveries from large and complex data sets; to motivate our review on this topic, we highlight several examples. First, feature importance and feature selection in supervised learning are popular forms of interpretation that have led to major discoveries like discovering new genomic biomarkers of diseases (Guyon et al., 2002), discovering physical laws governing dynamical systems (Brunton et al., 2016), and discovering lesions and other abnormalities in radiology (Borjali et al., 2020, Reyes et al., 2020). While most of the IML literature focuses on supervised learning (Molnar, 2022, Lipton, 2018, Guidotti et al., 2018, Doshi-Velez & Kim, 2017), there have been many major scientific discoveries made via unsupervised techniques and we argue that these approaches
Abnormal Trading Detection in the NFT Market
Song, Mingxiao, Liu, Yunsong, Shah, Agam, Chava, Sudheer
The Non-Fungible-Token (NFT) market has experienced explosive growth in recent years. According to DappRadar, the total transaction volume on OpenSea, the largest NFT marketplace, reached 34.7 billion dollars in February 2023. However, the NFT market is mostly unregulated and there are significant concerns about money laundering, fraud and wash trading. The lack of industry-wide regulations, and the fact that amateur traders and retail investors comprise a significant fraction of the NFT market, make this market particularly vulnerable to fraudulent activities. Therefore it is essential to investigate and highlight the relevant risks involved in NFT trading. In this paper, we attempted to uncover common fraudulent behaviors such as wash trading that could mislead other traders. Using market data, we designed quantitative features from the network, monetary, and temporal perspectives that were fed into K-means clustering unsupervised learning algorithm to sort traders into groups. Lastly, we discussed the clustering results' significance and how regulations can reduce undesired behaviors. Our work can potentially help regulators narrow down their search space for bad actors in the market as well as provide insights for amateur traders to protect themselves from unforeseen frauds.
Data-driven identification and analysis of the glass transition in polymer melts
Banerjee, Atreyee, Hsu, Hsiao-Ping, Kremer, Kurt, Kukharenko, Oleksandra
Understanding the nature of glass transition, as well as precise estimation of the glass transition temperature for polymeric materials, remain open questions in both experimental and theoretical polymer sciences. We propose a data-driven approach, which utilizes the high-resolution details accessible through the molecular dynamics simulation and considers the structural information of individual chains. It clearly identifies the glass transition temperature of polymer melts of weakly semiflexible chains. By combining principal component analysis and clustering, we identify the glass transition temperature in the asymptotic limit even from relatively short-time trajectories, which just reach into the Rouse-like monomer displacement regime. We demonstrate that fluctuations captured by the principal component analysis reflect the change in a chain's behaviour: from conformational rearrangement above to small rearrangements below the glass transition temperature. Our approach is straightforward to apply, and should be applicable to other polymeric glass-forming liquids.
Explainable Graph Spectral Clustering of Text Documents
Starosta, Bartลomiej, Kลopotek, Mieczysลaw A., Wierzchoล, Sลawomir T.
Spectral clustering methods are known for their ability to represent clusters of diverse shapes, densities etc. However, results of such algorithms, when applied e.g. to text documents, are hard to explain to the user, especially due to embedding in the spectral space which has no obvious relation to document contents. Therefore there is an urgent need to elaborate methods for explaining the outcome of the clustering. This paper presents a contribution towards this goal. We present a proposal of explanation of results of combinatorial Laplacian based graph spectral clustering. It is based on showing (approximate) equivalence of combinatorial Laplacian embedding, $K$-embedding (proposed in this paper) and term vector space embedding. Hence a bridge is constructed between the textual contents and the clustering results. We provide theoretical background for this approach. We performed experimental study showing that $K$-embedding approximates well Laplacian embedding under favourable block matrix conditions and show that approximation is good enough under other conditions.
Graphical Dirichlet Process for Clustering Non-Exchangeable Grouped Data
Chakrabarti, Arhit, Ni, Yang, Morris, Ellen Ruth A., Salinas, Michael L., Chapkin, Robert S., Mallick, Bani K.
We consider the problem of clustering grouped data with possibly non-exchangeable groups whose dependencies can be characterized by a known directed acyclic graph. To allow the sharing of clusters among the non-exchangeable groups, we propose a Bayesian nonparametric approach, termed graphical Dirichlet process, that jointly models the dependent group-specific random measures by assuming each random measure to be distributed as a Dirichlet process whose concentration parameter and base probability measure depend on those of its parent groups. The resulting joint stochastic process respects the Markov property of the directed acyclic graph that links the groups. We characterize the graphical Dirichlet process using a novel hypergraph representation as well as the stick-breaking representation, the restaurant-type representation, and the representation as a limit of a finite mixture model. We develop an efficient posterior inference algorithm and illustrate our model with simulations and a real grouped single-cell dataset.
CBCL-PR: A Cognitively Inspired Model for Class-Incremental Learning in Robotics
For most real-world applications, robots need to adapt and learn continually with limited data in their environments. In this paper, we consider the problem of Few-Shot class Incremental Learning (FSIL), in which an AI agent is required to learn incrementally from a few data samples without forgetting the data it has previously learned. To solve this problem, we present a novel framework inspired by theories of concept learning in the hippocampus and the neocortex. Our framework represents object classes in the form of sets of clusters and stores them in memory. The framework replays data generated by the clusters of the old classes, to avoid forgetting when learning new classes. Our approach is evaluated on two object classification datasets resulting in state-of-the-art (SOTA) performance for class-incremental learning and FSIL. We also evaluate our framework for FSIL on a robot demonstrating that the robot can continually learn to classify a large set of household objects with limited human assistance.
A Trajectory K-Anonymity Model Based on Point Density and Partition
Yu, Wanshu, Shi, Haonan, Xu, Hongyun
As people's daily life becomes increasingly inseparable from various mobile electronic devices, relevant service application platforms and network operators can collect numerous individual information easily. When releasing these data for scientific research or commercial purposes, users' privacy will be in danger, especially in the publication of spatiotemporal trajectory datasets. Therefore, to avoid the leakage of users' privacy, it is necessary to anonymize the data before they are released. However, more than simply removing the unique identifiers of individuals is needed to protect the trajectory privacy, because some attackers may infer the identity of users by the connection with other databases. Much work has been devoted to merging multiple trajectories to avoid re-identification, but these solutions always require sacrificing data quality to achieve the anonymity requirement. In order to provide sufficient privacy protection for users' trajectory datasets, this paper develops a study on trajectory privacy against re-identification attacks, proposing a trajectory K-anonymity model based on Point Density and Partition (KPDP). Our approach improves the existing trajectory generalization anonymization techniques regarding trajectory set partition preprocessing and trajectory clustering algorithms. It successfully resists re-identification attacks and reduces the data utility loss of the k-anonymized dataset. A series of experiments on a real-world dataset show that the proposed model has significant advantages in terms of higher data utility and shorter algorithm execution time than other existing techniques.
Utilisation of open intent recognition models for customer support intent detection
Mohammad, Rasheed, Favell, Oliver, Shah, Shariq, Cooper, Emmett, Vakaj, Edlira
Businesses have sought out new solutions to provide support and improve customer satisfaction as more products and services have become interconnected digitally. There is an inherent need for businesses to provide or outsource fast, efficient and knowledgeable support to remain competitive. Support solutions are also advancing with technologies, including use of social media, Artificial Intelligence (AI), Machine Learning (ML) and remote device connectivity to better support customers. Customer support operators are trained to utilise these technologies to provide better customer outreach and support for clients in remote areas. Interconnectivity of products and support systems provide businesses with potential international clients to expand their product market and business scale. This paper reports the possible AI applications in customer support, done in collaboration with the Knowledge Transfer Partnership (KTP) program between Birmingham City University and a company that handles customer service systems for businesses outsourcing customer support across a wide variety of business sectors. This study explored several approaches to accurately predict customers' intent using both labelled and unlabelled textual data. While some approaches showed promise in specific datasets, the search for a single, universally applicable approach continues. The development of separate pipelines for intent detection and discovery has led to improved accuracy rates in detecting known intents, while further work is required to improve the accuracy of intent discovery for unknown intents.
CARL-G: Clustering-Accelerated Representation Learning on Graphs
Shiao, William, Saini, Uday Singh, Liu, Yozen, Zhao, Tong, Shah, Neil, Papalexakis, Evangelos E.
Self-supervised learning on graphs has made large strides in achieving great performance in various downstream tasks. However, many state-of-the-art methods suffer from a number of impediments, which prevent them from realizing their full potential. For instance, contrastive methods typically require negative sampling, which is often computationally costly. While non-contrastive methods avoid this expensive step, most existing methods either rely on overly complex architectures or dataset-specific augmentations. In this paper, we ask: Can we borrow from classical unsupervised machine learning literature in order to overcome those obstacles? Guided by our key insight that the goal of distance-based clustering closely resembles that of contrastive learning: both attempt to pull representations of similar items together and dissimilar items apart. As a result, we propose CARL-G - a novel clustering-based framework for graph representation learning that uses a loss inspired by Cluster Validation Indices (CVIs), i.e., internal measures of cluster quality (no ground truth required). CARL-G is adaptable to different clustering methods and CVIs, and we show that with the right choice of clustering method and CVI, CARL-G outperforms node classification baselines on 4/5 datasets with up to a 79x training speedup compared to the best-performing baseline. CARL-G also performs at par or better than baselines in node clustering and similarity search tasks, training up to 1,500x faster than the best-performing baseline. Finally, we also provide theoretical foundations for the use of CVI-inspired losses in graph representation learning.
DeepVAT: A Self-Supervised Technique for Cluster Assessment in Image Datasets
Mazumder, Alokendu, Baruah, Tirthajit, Singh, Akash Kumar, Murthy, Pagadla Krishna, Pattanaik, Vishwajeet, Rathore, Punit
Estimating the number of clusters and cluster structures in unlabeled, complex, and high-dimensional datasets (like images) is challenging for traditional clustering algorithms. In recent years, a matrix reordering-based algorithm called Visual Assessment of Tendency (VAT), and its variants have attracted many researchers from various domains to estimate the number of clusters and inherent cluster structure present in the data. However, these algorithms face significant challenges when dealing with image data as they fail to effectively capture the crucial features inherent in images. To overcome these limitations, we propose a deep-learning-based framework that enables the assessment of cluster structure in complex image datasets. Our approach utilizes a self-supervised deep neural network to generate representative embeddings for the data. These embeddings are then reduced to 2-dimension using t-distributed Stochastic Neighbour Embedding (t-SNE) and inputted into VAT based algorithms to estimate the underlying cluster structure. Importantly, our framework does not rely on any prior knowledge of the number of clusters. Our proposed approach demonstrates superior performance compared to state-of-the-art VAT family algorithms and two other deep clustering algorithms on four benchmark image datasets, namely MNIST, FMNIST, CIFAR-10, and INTEL.