Clustering
CluStRE: Streaming Graph Clustering with Multi-Stage Refinement
Chhabra, Adil, Peretz, Shai Dorian, Schulz, Christian
We present CluStRE, a novel streaming graph clustering algorithm that balances computational efficiency with high-quality clustering using multi-stage refinement. Unlike traditional in-memory clustering approaches, CluStRE processes graphs in a streaming setting, significantly reducing memory overhead while leveraging re-streaming and evolutionary heuristics to improve solution quality. Our method dynamically constructs a quotient graph, enabling modularity-based optimization while efficiently handling large-scale graphs. We introduce multiple configurations of CluStRE to provide trade-offs between speed, memory consumption, and clustering quality. Experimental evaluations demonstrate that CluStRE improves solution quality by 89.8%, operates 2.6 times faster, and uses less than two-thirds of the memory required by the state-of-the-art streaming clustering algorithm on average. Moreover, our strongest mode enhances solution quality by up to 150% on average. With this, CluStRE achieves comparable solution quality to in-memory algorithms, i.e. over 96% of the quality of clustering approaches, including Louvain, effectively bridging the gap between streaming and traditional clustering methods.
Causal Learning for Heterogeneous Subgroups Based on Nonlinear Causal Kernel Clustering
Liu, Lu, Tang, Yang, Zhang, Kexuan, Sun, Qiyu
Due to the challenge posed by multi-source and heterogeneous data collected from diverse environments, causal relationships among features can exhibit variations influenced by different time spans, regions, or strategies. This diversity makes a single causal model inadequate for accurately representing complex causal relationships in all observational data, a crucial consideration in causal learning. To address this challenge, the nonlinear Causal Kernel Clustering method is introduced for heterogeneous subgroup causal learning, highlighting variations in causal relationships across diverse subgroups. The main component for clustering heterogeneous subgroups lies in the construction of the $u$-centered sample mapping function with the property of unbiased estimation, which assesses the differences in potential nonlinear causal relationships in various samples and supported by causal identifiability theory. Experimental results indicate that the method performs well in identifying heterogeneous subgroups and enhancing causal learning, leading to a reduction in prediction error.
Exploring Visual Embedding Spaces Induced by Vision Transformers for Online Auto Parts Marketplaces
This study examines the capabilities of the Vision Transformer (ViT) model in generating visual embeddings for images of auto parts sourced from online marketplaces, such as Craigslist and OfferUp. By focusing exclusively on single-modality data, the analysis evaluates ViT's potential for detecting patterns indicative of illicit activities. The workflow involves extracting high-dimensional embeddings from images, applying dimensionality reduction techniques like Uniform Manifold Approximation and Projection (UMAP) to visualize the embedding space, and using K-Means clustering to categorize similar items. Representative posts nearest to each cluster centroid provide insights into the composition and characteristics of the clusters. While the results highlight the strengths of ViT in isolating visual patterns, challenges such as overlapping clusters and outliers underscore the limitations of single-modal approaches in this domain. This work contributes to understanding the role of Vision Transformers in analyzing online marketplaces and offers a foundation for future advancements in detecting fraudulent or illegal activities.
Feature Explosion: a generic optimization strategy for outlier detection algorithms
Outlier detection tasks aim at discovering potential issues or opportunities and are widely used in cybersecurity, financial security, industrial inspection, etc. To date, thousands of outlier detection algorithms have been proposed. Clearly, in real-world scenarios, such a large number of algorithms is unnecessary. In other words, a large number of outlier detection algorithms are redundant. We believe the root cause of this redundancy lies in the current highly customized (i.e., non-generic) optimization strategies. Specifically, when researchers seek to improve the performance of existing outlier detection algorithms, they have to design separate optimized versions tailored to the principles of each algorithm, leading to an ever-growing number of outlier detection algorithms. To address this issue, in this paper, we introduce the explosion from physics into the outlier detection task and propose a generic optimization strategy based on feature explosion, called OSD (Optimization Strategy for outlier Detection algorithms). In the future, when improving the performance of existing outlier detection algorithms, it will be sufficient to invoke the OSD plugin without the need to design customized optimized versions for them. We compared the performances of 14 outlier detection algorithms on 24 datasets before and after invoking the OSD plugin. The experimental results show that the performances of all outlier detection algorithms are improved on almost all datasets. In terms of average accuracy, OSD make these outlier detection algorithms improve by 15% (AUC), 63.7% (AP).
Export Reviews, Discussions, Author Feedback and Meta-Reviews
The paper starts by situating the problem and motivating their approach - essentially, enabling embeddings for weighted graphs by extending the original implicit embedding performed in Lovasz "Shannon capacity of a graph" paper, which operated on binary graphs. The paper then presents the connection between kernel machines and the Lovasz number in the unweighted case, and extends previous work for vertex-weighted, edge-weighted, and LS-labelled graphs. Section 3 provides practical details for computation, and section 4 motivates the use of their approach for the clustering problem - setting the number of clusters by using their \vartheta 1 bound, and initialising the clusters by starting by vertices with large alpha_i values. Finally, they show results on max-cut, clustering, overlapping clustering, and summarization tasks. The paper ties together very different work to propose a coherent approach to graph embedding. The contributions are clearly laid out, and the references to previous work is well established and used.
Computing and Learning on Combinatorial Data
The twenty-first century is a data-driven era where human activities and behavior, physical phenomena, scientific discoveries, technology advancements, and almost everything that happens in the world resulting in massive generation, collection, and utilization of data. Connectivity in data is a crucial property. A straightforward example is the World Wide Web, where every webpage is connected to other web pages through hyperlinks, providing a form of directed connectivity. Combinatorial data refers to combinations of data items based on certain connectivity rules. Other forms of combinatorial data include social networks, meshes, community clusters, set systems, and molecules. This Ph.D. dissertation focuses on learning and computing with combinatorial data. We study and examine topological and connectivity features within and across connected data to improve the performance of learning and achieve high algorithmic efficiency.
Online Robot Motion Planning Methodology Guided by Group Social Proxemics Feature
Mu, Xuan, Liu, Xiaorui, Guo, Shuai, Chi, Wenzheng, Wang, Wei, Ge, Shuzhi Sam
Nowadays robot is supposed to demonstrate human-like perception, reasoning and behavior pattern in social or service application. However, most of the existing motion planning methods are incompatible with above requirement. A potential reason is that the existing navigation algorithms usually intend to treat people as another kind of obstacle, and hardly take the social principle or awareness into consideration. In this paper, we attempt to model the proxemics of group and blend it into the scenario perception and navigation of robot. For this purpose, a group clustering method considering both social relevance and spatial confidence is introduced. It can enable robot to identify individuals and divide them into groups. Next, we propose defining the individual proxemics within magnetic dipole model, and further established the group proxemics and scenario map through vector-field superposition. On the basis of the group clustering and proxemics modeling, we present the method to obtain the optimal observation positions (OOPs) of group. Once the OOPs grid and scenario map are established, a heuristic path is employed to generate path that guide robot cruising among the groups for interactive purpose. A series of experiments are conducted to validate the proposed methodology on the practical robot, the results have demonstrated that our methodology has achieved promising performance on group recognition accuracy and path-generation efficiency. This concludes that the group awareness evolved as an important module to make robot socially behave in the practical scenario.
Describing Nonstationary Data Streams in Frequency Domain
Concept drift is among the primary challenges faced by the data stream processing methods. The drift detection strategies, designed to counteract the negative consequences of such changes, often rely on analyzing the problem metafeatures. This work presents the Frequency Filtering Metadescriptor -- a tool for characterizing the data stream that searches for the informative frequency components visible in the sample's feature vector. The frequencies are filtered according to their variance across all available data batches. The presented solution is capable of generating a metadescription of the data stream, separating chunks into groups describing specific concepts on its basis, and visualizing the frequencies in the original spatial domain. The experimental analysis compared the proposed solution with two state-of-the-art strategies and with the PCA baseline in the post-hoc concept identification task. The research is followed by the identification of concepts in the real-world data streams. The generalization in the frequency domain adapted in the proposed solution allows to capture the complex feature dependencies as a reduced number of frequency components, while maintaining the semantic meaning of data.
Modular Training of Neural Networks aids Interpretability
Golechha, Satvik, Chaudhary, Maheep, Velja, Joan, Abate, Alessandro, Schoots, Nandi
An approach to improve neural network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We define a measure for clusterability and show that pre-trained models form highly enmeshed clusters via spectral graph clustering. We thus train models to be more modular using a "clusterability loss" function that encourages the formation of non-interacting clusters. Using automated interpretability techniques, we show that our method can help train models that are more modular and learn different, disjoint, and smaller circuits. We investigate CNNs trained on MNIST and CIFAR, small transformers trained on modular addition, and language models. Our approach provides a promising direction for training neural networks that learn simpler functions and are easier to interpret.
Constant-Factor Distortion Mechanisms for $k$-Committee Election
Pulyassary, Haripriya, Swamy, Chaitanya
In the $k$-committee election problem, we wish to aggregate the preferences of $n$ agents over a set of alternatives and select a committee of $k$ alternatives that minimizes the cost incurred by the agents. While we typically assume that agent preferences are captured by a cardinal utility function, in many contexts we only have access to ordinal information, namely the agents' rankings over the outcomes. As preference rankings are not as expressive as cardinal utilities, a loss of efficiency is inevitable, and is quantified by the notion of \emph{distortion}. We study the problem of electing a $k$-committee that minimizes the sum of the $\ell$-largest costs incurred by the agents, when agents and candidates are embedded in a metric space. This problem is called the $\ell$-centrum problem and captures both the utilitarian and egalitarian objectives. When $k \geq 2$, it is not possible to compute a bounded-distortion committee using purely ordinal information. We develop the first algorithms (that we call mechanisms) for the $\ell$-centrum problem (when $k \geq 2$), which achieve $O(1)$-distortion while eliciting only a very limited amount of cardinal information via value queries. We obtain two types of query-complexity guarantees: $O(\log k \log n)$ queries \emph{per agent}, and $O(k^2 \log^2 n)$ queries \emph{in total} (while achieving $O(1)$-distortion in both cases). En route, we give a simple adaptive-sampling algorithm for the $\ell$-centrum $k$-clustering problem.