Clustering
Act Natural! Extending Naturalistic Projection to Multimodal Behavior Scenarios
Khan, Hamzah I., Fridovich-Keil, David
--Autonomous agents operating in public spaces must consider how their behaviors might affect the humans around them, even when not directly interacting with them. T o this end, it is often beneficial to be predictable and appear naturalistic. Existing methods for this purpose use human actor intent modeling or imitation learning techniques, but these approaches rarely capture all possible motivations for human behavior and/or require significant amounts of data. Our work extends a technique for modeling unimodal naturalistic behaviors with an explicit convex set representation, to account for multimodal behavior by using multiple convex sets. This more flexible representation provides a higher degree of fidelity in data-driven modeling of naturalistic behavior that arises in real-world scenarios in which human behavior is, in some sense, discrete, e.g. Equipped with this new set representation, we develop an optimization-based filter to project arbitrary trajectories into the set so that they appear naturalistic to humans in the scene, while also satisfying vehicle dynamics, actuator limits, etc. We demonstrate our methods on real-world human driving data from the inD (intersection) and rounD (roundabout) datasets. Safe and comfortable interaction between humans and autonomous agents requires a measure of predictable and naturalistic behavior from autonomous systems. Autonomous agents in the real world can easily find themselves in unsafe situations when they violate these informal norms of humanlike behavior. As one example, autonomous vehicles are well-documented to behave more cautiously than human drivers expect, which can lead to human drivers reacting unsafely to unexpected or abnormal driving and causing collisions [1]. Thus, autonomous agents must be able to plan and execute naturalistic, human-like behavior. However, naturalistic behavior tends to be challenging to model mathematically because human preferences and decision-making are opaque. Nevertheless, there exists a need for techniques which are able to model the wide variety of naturalistic behavior based on observations of human actions. This work was supported by the National Science Foundation under Grants 2211548 and 2336840. Each nonconvex set is represented by the union of a set of convex hulls which are formed by clustering the trajectory states at each time. Then, we project arbitrary trajectories into this set to make the behaviors more naturalistic.
Toward Data-centric Directed Graph Learning: An Entropy-driven Approach
Li, Xunkai, Wu, Zhengyu, Yu, Kaichi, Qin, Hongchao, Zeng, Guang, Li, Rong-Hua, Wang, Guoren
The directed graph (digraph), as a generalization of undirected graphs, exhibits superior representation capability in modeling complex topology systems and has garnered considerable attention in recent years. Despite the notable efforts made by existing DiGraph Neural Networks (DiGNNs) to leverage directed edges, they still fail to comprehensively delve into the abundant data knowledge concealed in the digraphs. This data-level limitation results in model-level sub-optimal predictive performance and underscores the necessity of further exploring the potential correlations between the directed edges (topology) and node profiles (feature and labels) from a data-centric perspective, thereby empowering model-centric neural networks with stronger encoding capabilities. In this paper, we propose \textbf{E}ntropy-driven \textbf{D}igraph knowl\textbf{E}dge distillatio\textbf{N} (EDEN), which can serve as a data-centric digraph learning paradigm or a model-agnostic hot-and-plug data-centric Knowledge Distillation (KD) module. The core idea is to achieve data-centric ML, guided by our proposed hierarchical encoding theory for structured data. Specifically, EDEN first utilizes directed structural measurements from a topology perspective to construct a coarse-grained Hierarchical Knowledge Tree (HKT). Subsequently, EDEN quantifies the mutual information of node profiles to refine knowledge flow in the HKT, enabling data-centric KD supervision within model training. As a general framework, EDEN can also naturally extend to undirected scenarios and demonstrate satisfactory performance. In our experiments, EDEN has been widely evaluated on 14 (di)graph datasets (homophily and heterophily) and across 4 downstream tasks. The results demonstrate that EDEN attains SOTA performance and exhibits strong improvement for prevalent (Di)GNNs.
Clustering Internet Memes Through Template Matching and Multi-Dimensional Similarity
Meme clustering is critical for toxicity detection, virality modeling, and typing, but it has received little attention in previous research. Clustering similar Internet memes is challenging due to their multimodality, cultural context, and adaptability. Existing approaches rely on databases, overlook semantics, and struggle to handle diverse dimensions of similarity. This paper introduces a novel method that uses template-based matching with multi-dimensional similarity features, thus eliminating the need for predefined databases and supporting adaptive matching. Memes are clustered using local and global features across similarity categories such as form, visual content, text, and identity. Our combined approach outperforms existing clustering methods, producing more consistent and coherent clusters, while similarity-based feature sets enable adaptability and align with human intuition. We make all supporting code publicly available to support subsequent research.
OmicsCL: Unsupervised Contrastive Learning for Cancer Subtype Discovery and Survival Stratification
Unsupervised learning of disease subtypes from multi-omics data presents a significant opportunity for advancing personalized medicine. We introduce OmicsCL, a modular contrastive learning framework that jointly embeds heterogeneous omics modalities-such as gene expression, DNA methylation, and miRNA expression-into a unified latent space. Our method incorporates a survival-aware contrastive loss that encourages the model to learn representations aligned with survival-related patterns, without relying on labeled outcomes. Evaluated on the TCGA BRCA dataset, OmicsCL uncovers clinically meaningful clusters and achieves strong unsupervised concordance with patient survival. The framework demonstrates robustness across hyperparameter configurations and can be tuned to prioritize either subtype coherence or survival stratification. Ablation studies confirm that integrating survival-aware loss significantly enhances the predictive power of learned embeddings. These results highlight the promise of contrastive objectives for biological insight discovery in high-dimensional, heterogeneous omics data.
TNStream: Applying Tightest Neighbors to Micro-Clusters to Define Multi-Density Clusters in Streaming Data
Zeng, Qifen, Bao, Haomin, Hu, Yuanzhuo, Zhang, Zirui, Zheng, Yuheng, Wen, Luosheng
In data stream clustering, systematic theory of stream clustering algorithms remains relatively scarce. Recently, density-based methods have gained attention. However, existing algorithms struggle to simultaneously handle arbitrarily shaped, multi-density, high-dimensional data while maintaining strong outlier resistance. Clustering quality significantly deteriorates when data density varies complexly. This paper proposes a clustering algorithm based on the novel concept of Tightest Neighbors and introduces a data stream clustering theory based on the Skeleton Set. Based on these theories, this paper develops a new method, TNStream, a fully online algorithm. The algorithm adaptively determines the clustering radius based on local similarity, summarizing the evolution of multi-density data streams in micro-clusters. It then applies a Tightest Neighbors-based clustering algorithm to form final clusters. To improve efficiency in high-dimensional cases, Locality-Sensitive Hashing (LSH) is employed to structure micro-clusters, addressing the challenge of storing k-nearest neighbors. TNStream is evaluated on various synthetic and real-world datasets using different clustering metrics. Experimental results demonstrate its effectiveness in improving clustering quality for multi-density data and validate the proposed data stream clustering theory.
A Hybrid Mixture of $t$-Factor Analyzers for Clustering High-dimensional Data
This paper develops a novel hybrid approach for estimating the mixture model of $t$-factor analyzers (MtFA) that employs multivariate $t$-distribution and factor model to cluster and characterize grouped data. The traditional estimation method for MtFA faces computational challenges, particularly in high-dimensional settings, where the eigendecomposition of large covariance matrices and the iterative nature of Expectation-Maximization (EM) algorithms lead to scalability issues. We propose a computational scheme that integrates a profile likelihood method into the EM framework to efficiently obtain the model parameter estimates. The effectiveness of our approach is demonstrated through simulations showcasing its superior computational efficiency compared to the existing method, while preserving clustering accuracy and resilience against outliers. Our method is applied to cluster the Gamma-ray bursts, reinforcing several claims in the literature that Gamma-ray bursts have heterogeneous subpopulations and providing characterizations of the estimated groups.
InvAASTCluster: On Applying Invariant-Based Program Clustering to Introductory Programming Assignments
Orvalho, Pedro, Janota, Mikolรกลก, Manquinho, Vasco
Due to the vast number of students enrolled in programming courses, there has been an increasing number of automated program repair techniques focused on introductory programming assignments (IPAs). Typically, such techniques use program clustering to take advantage of previous correct student implementations to repair a new incorrect submission. These repair techniques use clustering methods since analyzing all available correct submissions to repair a program is not feasible. However, conventional clustering methods rely on program representations based on features such as abstract syntax trees (ASTs), syntax, control flow, and data flow. This paper proposes InvAASTCluster, a novel approach for program clustering that uses dynamically generated program invariants to cluster semantically equivalent IPAs. InvAASTCluster's program representation uses a combination of the program's semantics, through its invariants, and its structure through its anonymized abstract syntax tree (AASTs). Invariants denote conditions that must remain true during program execution, while AASTs are ASTs devoid of variable and function names, retaining only their types. Our experiments show that the proposed program representation outperforms syntax-based representations when clustering a set of correct IPAs. Furthermore, we integrate InvAASTCluster into a state-of-the-art clustering-based program repair tool. Our results show that InvAASTCluster advances the current state-of-the-art when used by clustering-based repair tools by repairing around 13% more students' programs, in a shorter amount of time.
Disjunctive and Conjunctive Normal Form Explanations of Clusters Using Auxiliary Information
Downey, Robert F., Ravi, S. S.
We consider generating post-hoc explanations of clusters generated from various datasets using auxiliary information which was not used by clustering algorithms. Following terminology used in previous work, we refer to the auxiliary information as tags. Our focus is on two forms of explanations, namely disjunctive form (where the explanation for a cluster consists of a set of tags) and a two-clause conjunctive normal form (CNF) explanation (where the explanation consists of two sets of tags, combined through the AND operator). We use integer linear programming (ILP) as well as heuristic methods to generate these explanations. We experiment with a variety of datasets and discuss the insights obtained from our explanations. We also present experimental results regarding the scalability of our explanation methods.
Manifold Clustering with Schatten p-norm Maximization
Manifold clustering, with its exceptional ability to capture complex data structures, holds a pivotal position in cluster analysis. However, existing methods often focus only on finding the optimal combination between K-means and manifold learning, and overlooking the consistency between the data structure and labels. To address this issue, we deeply explore the relationship between K-means and manifold learning, and on this basis, fuse them to develop a new clustering framework. Specifically, the algorithm uses labels to guide the manifold structure and perform clustering on it, which ensures the consistency between the data structure and labels. Furthermore, in order to naturally maintain the class balance in the clustering process, we maximize the Schatten p-norm of labels, and provide a theoretical proof to support this. Additionally, our clustering framework is designed to be flexible and compatible with many types of distance functions, which facilitates efficient processing of nonlinear separable data. The experimental results of several databases confirm the superiority of our proposed model.
Radius-Guided Post-Clustering for Shape-Aware, Scalable Refinement of k-Means Results
Traditional k-means clustering underperforms on non-convex shapes and requires the number of clusters k to be specified in advance. We propose a simple geometric enhancement: after standard k-means, each cluster center is assigned a radius (the distance to its farthest assigned point), and clusters whose radii overlap are merged. This post-processing step loosens the requirement for exact k: as long as k is overestimated (but not excessively), the method can often reconstruct non-convex shapes through meaningful merges. We also show that this approach supports recursive partitioning: clustering can be performed independently on tiled regions of the feature space, then globally merged, making the method scalable and suitable for distributed systems. Implemented as a lightweight post-processing step atop scikit-learn's k-means, the algorithm performs well on benchmark datasets, achieving high accuracy with minimal additional computation.