Clustering
Tensor Dirichlet Process Multinomial Mixture Model for Passenger Trajectory Clustering
Li, Ziyue, Yan, Hao, Zhang, Chen, Wang, Andi, Ketter, Wolfgang, Sun, Lijun, Tsung, Fugee
Passenger clustering based on travel records is essential for transportation operators. However, existing methods cannot easily cluster the passengers due to the hierarchical structure of the passenger trip information, namely: each passenger has multiple trips, and each trip contains multi-dimensional multi-mode information. Furthermore, existing approaches rely on an accurate specification of the clustering number to start, which is difficult when millions of commuters are using the transport systems on a daily basis. In this paper, we propose a novel Tensor Dirichlet Process Multinomial Mixture model (Tensor-DPMM), which is designed to preserve the multi-mode and hierarchical structure of the multi-dimensional trip information via tensor, and cluster them in a unified one-step manner. The model also has the ability to determine the number of clusters automatically by using the Dirichlet Process to decide the probabilities for a passenger to be either assigned in an existing cluster or to create a new cluster: This allows our model to grow the clusters as needed in a dynamic manner. Finally, existing methods do not consider spatial semantic graphs such as geographical proximity and functional similarity between the locations, which may cause inaccurate clustering. To this end, we further propose a variant of our model, namely the Tensor-DPMM with Graph. For the algorithm, we propose a tensor Collapsed Gibbs Sampling method, with an innovative step of "disband and relocating", which disbands clusters with too small amount of members and relocates them to the remaining clustering. This avoids uncontrollable growing amounts of clusters. A case study based on Hong Kong metro passenger data is conducted to demonstrate the automatic process of learning the number of clusters, and the learned clusters are better in within-cluster compactness and cross-cluster separateness.
Analyzing scRNA-seq data by CCP-assisted UMAP and t-SNE
Single-cell RNA sequencing (scRNA-seq) is a relatively new technology that profiles the transcriptome of individual cells within a tissue or organ, aiming to gain understanding of gene expression, gene regulation, cell-cell interaction, spatial transcriptomics, signal transduction pathways, and more [1]. The typical workflow of scRNA-seq involves cell isolation, RNA extraction, library preparation, sequencing, and data analysis. One of the key challenges in scRNA-seq analysis is the enormous amount of data generated, which can be complex, nonuniform, noisy, unlabeled, and of excessively high dimensions. A typical data analysis pipeline involves data preprocessing, gene expression quantification, normalization and batch correction, dimensionality reduction, cell type identification, differential gene expression analysis, and pathway and functional analysis [2-7]. Principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP) are some of the most popular approaches for dimensionality reduction, clustering, and visualization of scRNA-seq data. PCA is a classical technique used for dimensionality reduction and visualization. It identifies the most important patterns and correlations in a high-dimensional dataset and expresses them as a linear combination of new and orthogonal components. Among them, the first few components are often regarded as principal components, which can be used to visualize the scRNA-seq data in a lower-dimensional space or to identify important gene expression patterns.
A New Paradigm for Generative Adversarial Networks based on Randomized Decision Rules
Kim, Sehwan, Song, Qifan, Liang, Faming
The Generative Adversarial Network (GAN) was recently introduced in the literature as a novel machine learning method for training generative models. It has many applications in statistics such as nonparametric clustering and nonparametric conditional independence tests. However, training the GAN is notoriously difficult due to the issue of mode collapse, which refers to the lack of diversity among generated data. In this paper, we identify the reasons why the GAN suffers from this issue, and to address it, we propose a new formulation for the GAN based on randomized decision rules. In the new formulation, the discriminator converges to a fixed point while the generator converges to a distribution at the Nash equilibrium. We propose to train the GAN by an empirical Bayes-like method by treating the discriminator as a hyper-parameter of the posterior distribution of the generator. Specifically, we simulate generators from its posterior distribution conditioned on the discriminator using a stochastic gradient Markov chain Monte Carlo (MCMC) algorithm, and update the discriminator using stochastic gradient descent along with simulations of the generators. We establish convergence of the proposed method to the Nash equilibrium. Apart from image generation, we apply the proposed method to nonparametric clustering and nonparametric conditional independence tests. A portion of the numerical results is presented in the supplementary material.
Semi-automated extraction of research topics and trends from NCI funding in radiological sciences from 2000-2020
Nguyen, Mark, Beidler, Peter, Tsai, Joseph, Anderson, August, Chen, Daniel, Kinahan, Paul, Kang, John
Investigators, funders, and the public desire knowledge on topics and trends in publicly funded research but current efforts in manual categorization are limited in scale and understanding. We developed a semi-automated approach to extract and name research topics, and applied this to \$1.9B of NCI funding over 21 years in the radiological sciences to determine micro- and macro-scale research topics and funding trends. Our method relies on sequential clustering of existing biomedical-based word embeddings, naming using subject matter experts, and visualization to discover trends at a macroscopic scale above individual topics. We present results using 15 and 60 cluster topics, where we found that 2D projection of grant embeddings reveals two dominant axes: physics-biology and therapeutic-diagnostic. For our dataset, we found that funding for therapeutics- and physics-based research have outpaced diagnostics- and biology-based research, respectively. We hope these results may (1) give insight to funders on the appropriateness of their funding allocation, (2) assist investigators in contextualizing their work and explore neighboring research domains, and (3) allow the public to review where their tax dollars are being allocated.
An Interactive Interface for Novel Class Discovery in Tabular Data
Troisemaine, Colin, Flocon-Cholet, Joachim, Gosselin, Stéphane, Reiffers-Masson, Alexandre, Vaton, Sandrine, Lemaire, Vincent
Novel Class Discovery (NCD) is the problem of trying to discover novel classes in an unlabeled set, given a labeled set of different but related classes. The majority of NCD methods proposed so far only deal with image data, despite tabular data being among the most widely used type of data in practical applications. To interpret the results of clustering or NCD algorithms, data scientists need to understand the domain- and application-specific attributes of tabular data. This task is difficult and can often only be performed by a domain expert. Therefore, this interface allows a domain expert to easily run state-of-the-art algorithms for NCD in tabular data. With minimal knowledge in data science, interpretable results can be generated.
Approximation Algorithms for Fair Range Clustering
Hotegni, Sèdjro S., Mahabadi, Sepideh, Vakilian, Ali
This paper studies the fair range clustering problem in which the data points are from different demographic groups and the goal is to pick $k$ centers with the minimum clustering cost such that each group is at least minimally represented in the centers set and no group dominates the centers set. More precisely, given a set of $n$ points in a metric space $(P,d)$ where each point belongs to one of the $\ell$ different demographics (i.e., $P = P_1 \uplus P_2 \uplus \cdots \uplus P_\ell$) and a set of $\ell$ intervals $[\alpha_1, \beta_1], \cdots, [\alpha_\ell, \beta_\ell]$ on desired number of centers from each group, the goal is to pick a set of $k$ centers $C$ with minimum $\ell_p$-clustering cost (i.e., $(\sum_{v\in P} d(v,C)^p)^{1/p}$) such that for each group $i\in \ell$, $|C\cap P_i| \in [\alpha_i, \beta_i]$. In particular, the fair range $\ell_p$-clustering captures fair range $k$-center, $k$-median and $k$-means as its special cases. In this work, we provide efficient constant factor approximation algorithms for fair range $\ell_p$-clustering for all values of $p\in [1,\infty)$.
Graph Pooling for Graph Neural Networks: Progress, Challenges, and Opportunities
Liu, Chuang, Zhan, Yibing, Wu, Jia, Li, Chang, Du, Bo, Hu, Wenbin, Liu, Tongliang, Tao, Dacheng
Graph neural networks have emerged as a leading architecture for many graph-level tasks, such as graph classification and graph generation. As an essential component of the architecture, graph pooling is indispensable for obtaining a holistic graph-level representation of the whole graph. Although a great variety of methods have been proposed in this promising and fast-developing research field, to the best of our knowledge, little effort has been made to systematically summarize these works. To set the stage for the development of future works, in this paper, we attempt to fill this gap by providing a broad review of recent methods for graph pooling. Specifically, 1) we first propose a taxonomy of existing graph pooling methods with a mathematical summary for each category; 2) then, we provide an overview of the libraries related to graph pooling, including the commonly used datasets, model architectures for downstream tasks, and open-source implementations; 3) next, we further outline the applications that incorporate the idea of graph pooling in a variety of domains; 4) finally, we discuss certain critical challenges facing current studies and share our insights on future potential directions for research on the improvement of graph pooling.
From structure mining to unsupervised exploration of atomic octahedral networks
Xian, R. Patrick, Morelock, Ryan J., Hadar, Ido, Musgrave, Charles B., Sutton, Christopher
Networks of atom-centered coordination octahedra commonly occur in inorganic and hybrid solid-state materials. Characterizing their spatial arrangements and characteristics is crucial for relating structures to properties for many materials families. The traditional method using case-by-case inspection becomes prohibitive for discovering trends and similarities in large datasets. Here, we operationalize chemical intuition to automate the geometric parsing, quantification, and classification of coordination octahedral networks. We find axis-resolved tilting trends in ABO$_{3}$ perovskite polymorphs, which assist in detecting oxidation state changes. Moreover, we develop a scale-invariant encoding scheme to represent these networks, which, combined with human-assisted unsupervised machine learning, allows us to taxonomize the inorganic framework polytypes in hybrid iodoplumbates (A$_x$Pb$_y$I$_z$). Consequently, we uncover a violation of Pauling's third rule and the design principles underpinning their topological diversity. Our results offer a glimpse into the vast design space of atomic octahedral networks and inform high-throughput, targeted screening of specific structure types.
Understanding human mobility patterns in Chicago: an analysis of taxi data using clustering techniques
Chauhan, Harish, Gupta, Nikunj, Haskell-Craig, Zoe
Understanding human mobility patterns is important in applications as diverse as urban planning, public health, and political organizing. One rich source of data on human mobility is taxi ride data. Using the city of Chicago as a case study, we examine data from taxi rides in 2016 with the goal of understanding how neighborhoods are interconnected. This analysis will provide a sense of which neighborhoods individuals are using taxis to travel between, suggesting regions to focus new public transit development efforts. Additionally, this analysis will map traffic circulation patterns and provide an understanding of where in the city people are traveling from and where they are heading to - perhaps informing traffic or road pollution mitigation efforts. For the first application, representing the data as an undirected graph will suffice. Transit lines run in both directions so simply a knowledge of which neighborhoods have high rates of taxi travel between them provides an argument for placing public transit along those routes. However, in order to understand the flow of people throughout a city, we must make a distinction between the neighborhood from which people are departing and the areas to which they are arriving - this requires methods that can deal with directed graphs. All developed codes can be found at https://github.com/Nikunj-Gupta/Spectral-Clustering-Directed-Graphs.
Contrastive Hierarchical Clustering
Znaleźniak, Michał, Rola, Przemysław, Kaszuba, Patryk, Tabor, Jacek, Śmieja, Marek
Deep clustering has been dominated by flat models, which split a dataset into a predefined number of groups. Although recent methods achieve an extremely high similarity with the ground truth on popular benchmarks, the information contained in the flat partition is limited. In this paper, we introduce CoHiClust, a Contrastive Hierarchical Clustering model based on deep neural networks, which can be applied to typical image data. By employing a self-supervised learning approach, CoHiClust distills the base network into a binary tree without access to any labeled data. The hierarchical clustering structure can be used to analyze the relationship between clusters, as well as to measure the similarity between data points. Experiments demonstrate that CoHiClust generates a reasonable structure of clusters, which is consistent with our intuition and image semantics. Moreover, it obtains superior clustering accuracy on most of the image datasets compared to the state-of-the-art flat clustering models. Our implementation is available at https://github.