Clustering
LSEnet: Lorentz Structural Entropy Neural Network for Deep Graph Clustering
Sun, Li, Huang, Zhenhao, Peng, Hao, Wang, Yujie, Liu, Chunyang, Yu, Philip S.
Graph clustering is a fundamental problem in machine learning. Deep learning methods achieve the state-of-the-art results in recent years, but they still cannot work without predefined cluster numbers. Such limitation motivates us to pose a more challenging problem of graph clustering with unknown cluster number. We propose to address this problem from a fresh perspective of graph information theory (i.e., structural information). In the literature, structural information has not yet been introduced to deep clustering, and its classic definition falls short of discrete formulation and modeling node features. In this work, we first formulate a differentiable structural information (DSI) in the continuous realm, accompanied by several theoretical results. By minimizing DSI, we construct the optimal partitioning tree where densely connected nodes in the graph tend to have the same assignment, revealing the cluster structure. DSI is also theoretically presented as a new graph clustering objective, not requiring the predefined cluster number. Furthermore, we design a neural LSEnet in the Lorentz model of hyperbolic space, where we integrate node features to structural information via manifold-valued graph convolution. Extensive empirical results on real graphs show the superiority of our approach.
Bottleneck-Minimal Indexing for Generative Document Retrieval
Du, Xin, Xiu, Lixin, Tanaka-Ishii, Kumiko
We apply an information-theoretic perspective to reconsider generative document retrieval (GDR), in which a document $x \in X$ is indexed by $t \in T$, and a neural autoregressive model is trained to map queries $Q$ to $T$. GDR can be considered to involve information transmission from documents $X$ to queries $Q$, with the requirement to transmit more bits via the indexes $T$. By applying Shannon's rate-distortion theory, the optimality of indexing can be analyzed in terms of the mutual information, and the design of the indexes $T$ can then be regarded as a {\em bottleneck} in GDR. After reformulating GDR from this perspective, we empirically quantify the bottleneck underlying GDR. Finally, using the NQ320K and MARCO datasets, we evaluate our proposed bottleneck-minimal indexing method in comparison with various previous indexing methods, and we show that it outperforms those methods.
Cascade-based Randomization for Inferring Causal Effects under Diffusion Interference
Fatemi, Zahra, Pouget-Abadie, Jean, Zheleva, Elena
The presence of interference, where the outcome of an individual may depend on the treatment assignment and behavior of neighboring nodes, can lead to biased causal effect estimation. Current approaches to network experiment design focus on limiting interference through cluster-based randomization, in which clusters are identified using graph clustering, and cluster randomization dictates the node assignment to treatment and control. However, cluster-based randomization approaches perform poorly when interference propagates in cascades, whereby the response of individuals to treatment propagates to their multi-hop neighbors. When we have knowledge of the cascade seed nodes, we can leverage this interference structure to mitigate the resulting causal effect estimation bias. With this goal, we propose a cascade-based network experiment design that initiates treatment assignment from the cascade seed node and propagates the assignment to their multi-hop neighbors to limit interference during cascade growth and thereby reduce the overall causal effect estimation error. Our extensive experiments on real-world and synthetic datasets demonstrate that our proposed framework outperforms the existing state-of-the-art approaches in estimating causal effects in network data.
A Constraint-Enforcing Reward for Adversarial Attacks on Text Classifiers
Roth, Tom, Unanue, Inigo Jauregi, Abuadbba, Alsharif, Piccardi, Massimo
Text classifiers are vulnerable to adversarial examples -- correctly-classified examples that are deliberately transformed to be misclassified while satisfying acceptability constraints. The conventional approach to finding adversarial examples is to define and solve a combinatorial optimisation problem over a space of allowable transformations. While effective, this approach is slow and limited by the choice of transformations. An alternate approach is to directly generate adversarial examples by fine-tuning a pre-trained language model, as is commonly done for other text-to-text tasks. This approach promises to be much quicker and more expressive, but is relatively unexplored. For this reason, in this work we train an encoder-decoder paraphrase model to generate a diverse range of adversarial examples. For training, we adopt a reinforcement learning algorithm and propose a constraint-enforcing reward that promotes the generation of valid adversarial examples. Experimental results over two text classification datasets show that our model has achieved a higher success rate than the original paraphrase model, and overall has proved more effective than other competitive attacks. Finally, we show how key design choices impact the generated examples and discuss the strengths and weaknesses of the proposed approach.
Multi-order Graph Clustering with Adaptive Node-level Weight Learning
Liu, Ye, Lin, Xuelei, Chen, Yejia, Cheng, Reynold
Current graph clustering methods emphasize individual node and edge con nections, while ignoring higher-order organization at the level of motif. Re cently, higher-order graph clustering approaches have been designed by motif based hypergraphs. However, these approaches often suffer from hypergraph fragmentation issue seriously, which degrades the clustering performance greatly. Moreover, real-world graphs usually contain diverse motifs, with nodes participating in multiple motifs. A key challenge is how to achieve precise clustering results by integrating information from multiple motifs at the node level. In this paper, we propose a multi-order graph clustering model (MOGC) to integrate multiple higher-order structures and edge connections at node level. MOGC employs an adaptive weight learning mechanism to au tomatically adjust the contributions of different motifs for each node. This not only tackles hypergraph fragmentation issue but enhances clustering accuracy. MOGC is efficiently solved by an alternating minimization algo rithm. Experiments on seven real-world datasets illustrate the effectiveness of MOGC.
Using Unsupervised Learning to Explore Robot-Pedestrian Interactions in Urban Environments
Zug, Sebastian, Jäger, Georg, Seyffer, Norman, Plank, Martin, Licht, Gero, Siebert, Felix Wilhelm
This study identifies a gap in data-driven approaches to robot-centric pedestrian interactions and proposes a corresponding pipeline. The pipeline utilizes unsupervised learning techniques to identify patterns in interaction data of urban environments, specifically focusing on conflict scenarios. Analyzed features include the robot's and pedestrian's speed and contextual parameters such as proximity to intersections. They are extracted and reduced in dimensionality using Principal Component Analysis (PCA). Finally, K-means clustering is employed to uncover underlying patterns in the interaction data. A use case application of the pipeline is presented, utilizing real-world robot mission data from a mid-sized German city. The results indicate the need for enriching interaction representations with contextual information to enable fine-grained analysis and reasoning. Nevertheless, they also highlight the need for expanding the data set and incorporating additional contextual factors to enhance the robots situational awareness and interaction quality.
Effective Clustering on Large Attributed Bipartite Graphs
Yang, Renchi, Wu, Yidu, Lin, Xiaoyang, Wang, Qichen, Chan, Tsz Nam, Shi, Jieming
Attributed bipartite graphs (ABGs) are an expressive data model for describing the interactions between two sets of heterogeneous nodes that are associated with rich attributes, such as customer-product purchase networks and author-paper authorship graphs. Partitioning the target node set in such graphs into k disjoint clusters (referred to as k-ABGC) finds widespread use in various domains, including social network analysis, recommendation systems, information retrieval, and bioinformatics. However, the majority of existing solutions towards k-ABGC either overlook attribute information or fail to capture bipartite graph structures accurately, engendering severely compromised result quality. The severity of these issues is accentuated in real ABGs, which often encompass millions of nodes and a sheer volume of attribute data, rendering effective k-ABGC over such graphs highly challenging. In this paper, we propose TPO, an effective and efficient approach to k-ABGC that achieves superb clustering performance on multiple real datasets. TPO obtains high clustering quality through two major contributions: (i) a novel formulation and transformation of the k-ABGC problem based on multi-scale attribute affinity specialized for capturing attribute affinities between nodes with the consideration of their multi-hop connections in ABGs, and (ii) a highly efficient solver that includes a suite of carefully-crafted optimizations for sidestepping explicit affinity matrix construction and facilitating faster convergence. Extensive experiments, comparing TPO against 19 baselines over 5 real ABGs, showcase the superior clustering quality of TPO measured against ground-truth labels. Moreover, compared to the state of the arts, TPO is often more than 40x faster over both small and large ABGs.
Kernel spectral joint embeddings for high-dimensional noisy datasets using duo-landmark integral operators
Integrative analysis of multiple heterogeneous datasets has become standard practice in many research fields, especially in single-cell genomics and medical informatics. Existing approaches oftentimes suffer from limited power in capturing nonlinear structures, insufficient account of noisiness and effects of high-dimensionality, lack of adaptivity to signals and sample sizes imbalance, and their results are sometimes difficult to interpret. To address these limitations, we propose a novel kernel spectral method that achieves joint embeddings of two independently observed high-dimensional noisy datasets. The proposed method automatically captures and leverages possibly shared low-dimensional structures across datasets to enhance embedding quality. The obtained low-dimensional embeddings can be utilized for many downstream tasks such as simultaneous clustering, data visualization, and denoising. The proposed method is justified by rigorous theoretical analysis. Specifically, we show the consistency of our method in recovering the low-dimensional noiseless signals, and characterize the effects of the signal-to-noise ratios on the rates of convergence. Under a joint manifolds model framework, we establish the convergence of ultimate embeddings to the eigenfunctions of some newly introduced integral operators. These operators, referred to as duo-landmark integral operators, are defined by the convolutional kernel maps of some reproducing kernel Hilbert spaces (RKHSs). These RKHSs capture the either partially or entirely shared underlying low-dimensional nonlinear signal structures of the two datasets. Our numerical experiments and analyses of two single-cell omics datasets demonstrate the empirical advantages of the proposed method over existing methods in both embeddings and several downstream tasks.
Multilingual Substitution-based Word Sense Induction
Kokosinskii, Denis, Arefyev, Nikolay
Word Sense Induction (WSI) is the task of discovering senses of an ambiguous word by grouping usages of this word into clusters corresponding to these senses. Many approaches were proposed to solve WSI in English and a few other languages, but these approaches are not easily adaptable to new languages. We present multilingual substitution-based WSI methods that support any of 100 languages covered by the underlying multilingual language model with minimal to no adaptation required. Despite the multilingual capabilities, our methods perform on par with the existing monolingual approaches on popular English WSI datasets. At the same time, they will be most useful for lower-resourced languages which miss lexical resources available for English, thus, have higher demand for unsupervised methods like WSI.
Towards Modular LLMs by Building and Reusing a Library of LoRAs
Ostapenko, Oleksiy, Su, Zhan, Ponti, Edoardo Maria, Charlin, Laurent, Roux, Nicolas Le, Pereira, Matheus, Caccia, Lucas, Sordoni, Alessandro
The growing number of parameter-efficient adaptations of a base large language model (LLM) calls for studying whether we can reuse such trained adapters to improve performance for new tasks. We study how to best build a library of adapters given multi-task data and devise techniques for both zero-shot and supervised task generalization through routing in such library. We benchmark existing approaches to build this library and introduce model-based clustering, MBC, a method that groups tasks based on the similarity of their adapter parameters, indirectly optimizing for transfer across the multi-task dataset. To re-use the library, we present a novel zero-shot routing mechanism, Arrow, which enables dynamic selection of the most relevant adapters for new inputs without the need for retraining. We experiment with several LLMs, such as Phi-2 and Mistral, on a wide array of held-out tasks, verifying that MBC-based adapters and Arrow routing lead to superior generalization to new tasks. We make steps towards creating modular, adaptable LLMs that can match or outperform traditional joint training.