Clustering
ConvergeWriter: Data-Driven Bottom-Up Article Construction
Ji, Binquan, Wang, Jiaqi, Li, Ruiting, Han, Xingchen, Qi, Yiyang, Wang, Shichao, Lu, Yifei, Han, Yuantao, Ren, Feiliang
Large Language Models (LLMs) have shown remarkable prowess in text generation, yet producing long-form, factual documents grounded in extensive external knowledge bases remains a significant challenge. Existing "top-down" methods, which first generate a hypothesis or outline and then retrieve evidence, often suffer from a disconnect between the model's plan and the available knowledge, leading to content fragmentation and factual inaccuracies. To address these limitations, we propose a novel "bottom-up," data-driven framework that inverts the conventional generation pipeline. Our approach is predicated on a "Retrieval-First for Knowledge, Clustering for Structure" strategy, which first establishes the "knowledge boundaries" of the source corpus before any generative planning occurs. Specifically, we perform exhaustive iterative retrieval from the knowledge base and then employ an unsupervised clustering algorithm to organize the retrieved documents into distinct "knowledge clusters." These clusters form an objective, data-driven foundation that directly guides the subsequent generation of a hierarchical outline and the final document content. This bottom-up process ensures that the generated text is strictly constrained by and fully traceable to the source material, proactively adapting to the finite scope of the knowledge base and fundamentally mitigating the risk of hallucination. Experimental results on both 14B and 32B parameter models demonstrate that our method achieves performance comparable to or exceeding state-of-the-art baselines, and is expected to demonstrate unique advantages in knowledge-constrained scenarios that demand high fidelity and structural coherence. Our work presents an effective paradigm for generating reliable, structured, long-form documents, paving the way for more robust LLM applications in high-stakes, knowledge-intensive domains.
MMM: Clustering Multivariate Longitudinal Mixed-type Data
Amato, Francesco, Jacques, Julien
Multivariate longitudinal data of mixed-type are increasingly collected in many science domains. However, algorithms to cluster this kind of data remain scarce, due to the challenge to simultaneously model the within- and between-time dependence structures for multivariate data of mixed kind. We introduce the Mixture of Mixed-Matrices (MMM) model: reorganizing the data in a three-way structure and assuming that the non-continuous variables are observations of underlying latent continuous variables, the model relies on a mixture of matrix-variate normal distributions to perform clustering in the latent dimension. The MMM model is thus able to handle continuous, ordinal, binary, nominal and count data and to concurrently model the heterogeneity, the association among the responses and the temporal dependence structure in a parsimonious way and without assuming conditional independence. The inference is carried out through an MCMC-EM algorithm, which is detailed. An evaluation of the model through synthetic data shows its inference abilities. A real-world application on financial data is presented.
SpaPool: Soft Partition Assignment Pooling for__Graph Neural Networks
Govan, Rodrigue, Scherrer, Romane, Fournier-Viger, Philippe, Selmaoui-Folcher, Nazha
This paper introduces SpaPool, a novel pooling method that combines the strengths of both dense and sparse techniques for a graph neural network. SpaPool groups vertices into an adaptive number of clusters, leveraging the benefits of both dense and sparse approaches. It aims to maintain the structural integrity of the graph while reducing its size efficiently. Experimental results on several datasets demonstrate that SpaPool achieves competitive performance compared to existing pooling techniques and excels particularly on small-scale graphs. This makes SpaPool a promising method for applications requiring efficient and effective graph processing.
ProteuS: A Generative Approach for Simulating Concept Drift in Financial Markets
Suรกrez-Cetrulo, Andrรฉs L., Cervantes, Alejandro, Quintana, David
Financial markets are complex, non-stationary systems where the underlying data distributions can shift over time, a phenomenon known as regime changes, as well as concept drift in the machine learning literature. These shifts, often triggered by major economic events, pose a significant challenge for traditional statistical and machine learning models. A fundamental problem in developing and validating adaptive algorithms is the lack of a ground truth in real-world financial data, making it difficult to evaluate a model's ability to detect and recover from these drifts. This paper addresses this challenge by introducing a novel framework, named ProteuS, for generating semi-synthetic financial time series with pre-defined structural breaks. Our methodology involves fitting ARMA-GARCH models to real-world ETF data to capture distinct market regimes, and then simulating realistic, gradual, and abrupt transitions between them. The resulting datasets, which include a comprehensive set of technical indicators, provide a controlled environment with a known ground truth of regime changes. An analysis of the generated data confirms the complexity of the task, revealing significant overlap between the different market states. We aim to provide the research community with a tool for the rigorous evaluation of concept drift detection and adaptation mechanisms, paving the way for more robust financial forecasting models.
Data-Driven Analysis of Text-Conditioned AI-Generated Music: A Case Study with Suno and Udio
Casini, Luca, Vila, Laura Cros, Dalmazzo, David, Kaila, Anna-Kaisa, Sturm, Bob L. T.
Online AI platforms for creating music from text prompts (AI music), such as Suno and Udio, are now being used by hundreds of thousands of users. Some AI music is appearing in advertising, and even charting, in multiple countries. How are these platforms being used? What subjects are inspiring their users? This article answers these questions for Suno and Udio using a large collection of songs generated by users of these platforms from May to October 2024. Using a combination of state-of-the-art text embedding models, dimensionality reduction and clustering methods, we analyze the prompts, tags and lyrics, and automatically annotate and display the processed data in interactive plots. Our results reveal prominent themes in lyrics, language preference, prompting strategies, as well as peculiar attempts at steering models through the use of metatags. To promote the musicological study of the developing cultural practice of AI-generated music we share our code and resources.
Disentanglement of Biological and Technical Factors via Latent Space Rotation in Clinical Imaging Improves Disease Pattern Discovery
Pan, Jeanny, Seebรถck, Philipp, Fรผrbรถck, Christoph, Pochepnia, Svitlana, Straub, Jennifer, Beer, Lucian, Prosch, Helmut, Langs, Georg
Identifying new disease-related patterns in medical imaging data with the help of machine learning enlarges the vocabulary of recognizable findings. This supports diagnostic and prognostic assessment. However, image appearance varies not only due to biological differences, but also due to imaging technology linked to vendors, scanning- or re- construction parameters. The resulting domain shifts impedes data representation learning strategies and the discovery of biologically meaningful cluster appearances. To address these challenges, we introduce an approach to actively learn the domain shift via post-hoc rotation of the data latent space, enabling disentanglement of biological and technical factors. Results on real-world heterogeneous clinical data showcase that the learned disentangled representation leads to stable clusters representing tissue-types across different acquisition settings. Cluster consistency is improved by +19.01% (ARI), +16.85% (NMI), and +12.39% (Dice) compared to the entangled representation, outperforming four state-of-the-art harmonization methods. When using the clusters to quantify tissue composition on idiopathic pulmonary fibrosis patients, the learned profiles enhance Cox survival prediction. This indicates that the proposed label-free framework facilitates biomarker discovery in multi-center routine imaging data. Code is available on GitHub https://github.com/cirmuw/latent-space-rotation-disentanglement.
Hierarchical Identity Learning for Unsupervised Visible-Infrared Person Re-Identification
Shi, Haonan, Wang, Yubin, Cheng, De, He, Lingfeng, Wang, Nannan, Gao, Xinbo
Abstract--Unsupervised visible-infrared person re-identification (USVI-ReID) aims to learn modality-invariant image features from unlabeled cross-modal person datasets by reducing the modality gap while minimizing reliance on costly manual annotations. Existing methods typically address USVI-ReID using cluster-based contrastive learning, which represents a person by a single cluster center . However, they primarily focus on the commonality of images within each cluster while neglecting the finer-grained differences among them. T o address the limitation, we propose a Hierarchical Identity Learning (HIL) framework. Since each cluster may contain several smaller sub-clusters that reflect fine-grained variations among images, we generate multiple memories for each existing coarse-grained cluster via a secondary clustering. Additionally, we propose Multi-Center Contrastive Learning (MCCL) to refine representations for enhancing intra-modal clustering and minimizing cross-modal discrepancies. T o further improve cross-modal matching quality, we design a Bidirectional Reverse Selection Transmission (BRST) mechanism, which establishes reliable cross-modal correspondences by performing bidirectional matching of pseudo-labels. Extensive experiments conducted on the SYSU-MM01 and RegDB datasets demonstrate that the proposed method outperforms existing approaches. ISIBLE-infrared person re-identification (VI-ReID) [1], [2], [3], [4], [5], [6] is an important research direction in the field of computer vision, aiming to match the images of the same person between the visible and infrared modalities.
Toward Quantum Utility in Finance: A Robust Data-Driven Algorithm for Asset Clustering
Sharma, Shivam, Venkatesh, Supreeth Mysore, Kachroo, Pushkin
Clustering financial assets based on return correlations is a fundamental task in portfolio optimization and statistical arbitrage. However, classical clustering methods often fall short when dealing with signed correlation structures, typically requiring lossy transformations and heuristic assumptions such as a fixed number of clusters. In this work, we apply the Graph-based Coalition Structure Generation algorithm (GCS-Q) to directly cluster signed, weighted graphs without relying on such transformations. GCS-Q formulates each partitioning step as a QUBO problem, enabling it to leverage quantum annealing for efficient exploration of exponentially large solution spaces. We validate our approach on both synthetic and real-world financial data, benchmarking against state-of-the-art classical algorithms such as SPONGE and k-Medoids. Our experiments demonstrate that GCS-Q consistently achieves higher clustering quality, as measured by Adjusted Rand Index and structural balance penalties, while dynamically determining the number of clusters. These results highlight the practical utility of near-term quantum computing for graph-based unsupervised learning in financial applications.
What Does Normal Even Mean? Evaluating Benign Traffic in Intrusion Detection Datasets
Wilkinson, Meghan, Thomson, Robert H
Supervised machine learning techniques rely on labeled data to achieve high task performance, but this requires the labels to capture some meaningful differences in the underlying data structure. For training network intrusion detection algorithms, most datasets contain a series of attack classes and a single large benign class which captures all non-attack network traffic. A review of intrusion detection papers and guides that explicitly state their data preprocessing steps identified that the majority took the labeled categories of the dataset at face value when training their algorithms. The present paper evaluates the structure of benign traffic in several common intrusion detection datasets (NSL-KDD, UNSW-NB15, and CIC-IDS 2017) and determines whether there are meaningful sub-categories within this traffic which may improve overall multi-classification performance using common machine learning techniques. We present an overview of some unsupervised clustering techniques (e.g., HDBSCAN, Mean Shift Clustering) and show how they differentially cluster the benign traffic space.
Persistent Homology of Topic Networks for the Prediction of Reader Curiosity
Hopp, Manuel D. S., Labatut, Vincent, Amalvy, Arthur, Dufour, Richard, Stone, Hannah, Jach, Hayley, Murayama, Kou
Reader curiosity, the drive to seek information, is crucial for textual engagement, yet remains relatively underexplored in NLP. Building on Loewenstein's Information Gap Theory, we introduce a framework that models reader curiosity by quantifying semantic information gaps within a text's semantic structure. Our approach leverages BERTopic-inspired topic modeling and persistent homology to analyze the evolving topology (connected components, cycles, voids) of a dynamic semantic network derived from text segments, treating these features as proxies for information gaps. To empirically evaluate this pipeline, we collect reader curiosity ratings from participants (n = 49) as they read S. Collins's ''The Hunger Games'' novel. We then use the topological features from our pipeline as independent variables to predict these ratings, and experimentally show that they significantly improve curiosity prediction compared to a baseline model (73% vs. 30% explained deviance), validating our approach. This pipeline offers a new computational method for analyzing text structure and its relation to reader engagement.