AITopics

2509.12811

Country: North America > Mexico > Mexico City (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Knowledge Management > Knowledge Engineering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Amato, Francesco, Jacques, Julien

MMM: Clustering Multivariate Longitudinal Mixed-type Data

arXiv.org Machine LearningSep-16-2025

Multivariate longitudinal data of mixed-type are increasingly collected in many science domains. However, algorithms to cluster this kind of data remain scarce, due to the challenge to simultaneously model the within- and between-time dependence structures for multivariate data of mixed kind. We introduce the Mixture of Mixed-Matrices (MMM) model: reorganizing the data in a three-way structure and assuming that the non-continuous variables are observations of underlying latent continuous variables, the model relies on a mixture of matrix-variate normal distributions to perform clustering in the latent dimension. The MMM model is thus able to handle continuous, ordinal, binary, nominal and count data and to concurrently model the heterogeneity, the association among the responses and the temporal dependence structure in a parsimonious way and without assuming conditional independence. The inference is carried out through an MCMC-EM algorithm, which is detailed. An evaluation of the model through synthetic data shows its inference abilities. A real-world application on financial data is presented.

algorithm, journal, longitudinal data, (17 more...)

arXiv.org Machine Learning

2509.12166

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.82)

Industry:

Information Technology (1.00)
Health & Medicine (1.00)
Banking & Finance > Trading (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.88)

Govan, Rodrigue, Scherrer, Romane, Fournier-Viger, Philippe, Selmaoui-Folcher, Nazha

SpaPool: Soft Partition Assignment Pooling for__Graph Neural Networks

arXiv.org Machine LearningSep-16-2025

This paper introduces SpaPool, a novel pooling method that combines the strengths of both dense and sparse techniques for a graph neural network. SpaPool groups vertices into an adaptive number of clusters, leveraging the benefits of both dense and sparse approaches. It aims to maintain the structural integrity of the graph while reducing its size efficiently. Experimental results on several datasets demonstrate that SpaPool achieves competitive performance compared to existing pooling techniques and excels particularly on small-scale graphs. This makes SpaPool a promising method for applications requiring efficient and effective graph processing.

graph, node, spapool, (11 more...)

arXiv.org Machine Learning

2509.11675

Country:

Oceania > New Caledonia > South Province > Noumea (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Suárez-Cetrulo, Andrés L., Cervantes, Alejandro, Quintana, David

ProteuS: A Generative Approach for Simulating Concept Drift in Financial Markets

Financial markets are complex, non-stationary systems where the underlying data distributions can shift over time, a phenomenon known as regime changes, as well as concept drift in the machine learning literature. These shifts, often triggered by major economic events, pose a significant challenge for traditional statistical and machine learning models. A fundamental problem in developing and validating adaptive algorithms is the lack of a ground truth in real-world financial data, making it difficult to evaluate a model's ability to detect and recover from these drifts. This paper addresses this challenge by introducing a novel framework, named ProteuS, for generating semi-synthetic financial time series with pre-defined structural breaks. Our methodology involves fitting ARMA-GARCH models to real-world ETF data to capture distinct market regimes, and then simulating realistic, gradual, and abrupt transitions between them. The resulting datasets, which include a comprehensive set of technical indicators, provide a controlled environment with a known ground truth of regime changes. An analysis of the generated data confirms the complexity of the task, revealing significant overlap between the different market states. We aim to provide the research community with a tool for the rigorous evaluation of concept drift detection and adaptation mechanisms, paving the way for more robust financial forecasting models.

algorithm, artificial intelligence, machine learning, (17 more...)

2509.11844

Country: Europe > Ireland (0.28)

Genre:

Research Report (0.64)
Workflow (0.46)

Industry:

Banking & Finance > Trading (1.00)
Banking & Finance > Economy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Casini, Luca, Vila, Laura Cros, Dalmazzo, David, Kaila, Anna-Kaisa, Sturm, Bob L. T.

Data-Driven Analysis of Text-Conditioned AI-Generated Music: A Case Study with Suno and Udio

Online AI platforms for creating music from text prompts (AI music), such as Suno and Udio, are now being used by hundreds of thousands of users. Some AI music is appearing in advertising, and even charting, in multiple countries. How are these platforms being used? What subjects are inspiring their users? This article answers these questions for Suno and Udio using a large collection of songs generated by users of these platforms from May to October 2024. Using a combination of state-of-the-art text embedding models, dimensionality reduction and clustering methods, we analyze the prompts, tags and lyrics, and automatically annotate and display the processed data in interactive plots. Our results reveal prominent themes in lyrics, language preference, prompting strategies, as well as peculiar attempts at steering models through the use of metatags. To promote the musicological study of the developing cultural practice of AI-generated music we share our code and resources.

large language model, lyric, machine learning, (22 more...)

2509.11824

Country: Europe (0.93)

Genre: Research Report > New Finding (0.48)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
(2 more...)

Disentanglement of Biological and Technical Factors via Latent Space Rotation in Clinical Imaging Improves Disease Pattern Discovery

Pan, Jeanny, Seeböck, Philipp, Fürböck, Christoph, Pochepnia, Svitlana, Straub, Jennifer, Beer, Lucian, Prosch, Helmut, Langs, Georg

Identifying new disease-related patterns in medical imaging data with the help of machine learning enlarges the vocabulary of recognizable findings. This supports diagnostic and prognostic assessment. However, image appearance varies not only due to biological differences, but also due to imaging technology linked to vendors, scanning- or re- construction parameters. The resulting domain shifts impedes data representation learning strategies and the discovery of biologically meaningful cluster appearances. To address these challenges, we introduce an approach to actively learn the domain shift via post-hoc rotation of the data latent space, enabling disentanglement of biological and technical factors. Results on real-world heterogeneous clinical data showcase that the learned disentangled representation leads to stable clusters representing tissue-types across different acquisition settings. Cluster consistency is improved by +19.01% (ARI), +16.85% (NMI), and +12.39% (Dice) compared to the entangled representation, outperforming four state-of-the-art harmonization methods. When using the clusters to quantify tissue composition on idiopathic pulmonary fibrosis patients, the learned profiles enhance Cox survival prediction. This indicates that the proposed label-free framework facilitates biomarker discovery in multi-center routine imaging data. Code is available on GitHub https://github.com/cirmuw/latent-space-rotation-disentanglement.

artificial intelligence, machine learning, representation, (14 more...)

2509.11436

Genre: Research Report > Experimental Study (0.90)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (0.89)
Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Hierarchical Identity Learning for Unsupervised Visible-Infrared Person Re-Identification

Shi, Haonan, Wang, Yubin, Cheng, De, He, Lingfeng, Wang, Nannan, Gao, Xinbo

Abstract--Unsupervised visible-infrared person re-identification (USVI-ReID) aims to learn modality-invariant image features from unlabeled cross-modal person datasets by reducing the modality gap while minimizing reliance on costly manual annotations. Existing methods typically address USVI-ReID using cluster-based contrastive learning, which represents a person by a single cluster center . However, they primarily focus on the commonality of images within each cluster while neglecting the finer-grained differences among them. T o address the limitation, we propose a Hierarchical Identity Learning (HIL) framework. Since each cluster may contain several smaller sub-clusters that reflect fine-grained variations among images, we generate multiple memories for each existing coarse-grained cluster via a secondary clustering. Additionally, we propose Multi-Center Contrastive Learning (MCCL) to refine representations for enhancing intra-modal clustering and minimizing cross-modal discrepancies. T o further improve cross-modal matching quality, we design a Bidirectional Reverse Selection Transmission (BRST) mechanism, which establishes reliable cross-modal correspondences by performing bidirectional matching of pseudo-labels. Extensive experiments conducted on the SYSU-MM01 and RegDB datasets demonstrate that the proposed method outperforms existing approaches. ISIBLE-infrared person re-identification (VI-ReID) [1], [2], [3], [4], [5], [6] is an important research direction in the field of computer vision, aiming to match the images of the same person between the visible and infrared modalities.

artificial intelligence, machine learning, person re-identification, (16 more...)

2509.11587

Country: Asia > China > Shaanxi Province (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Sharma, Shivam, Venkatesh, Supreeth Mysore, Kachroo, Pushkin

Toward Quantum Utility in Finance: A Robust Data-Driven Algorithm for Asset Clustering

Clustering financial assets based on return correlations is a fundamental task in portfolio optimization and statistical arbitrage. However, classical clustering methods often fall short when dealing with signed correlation structures, typically requiring lossy transformations and heuristic assumptions such as a fixed number of clusters. In this work, we apply the Graph-based Coalition Structure Generation algorithm (GCS-Q) to directly cluster signed, weighted graphs without relying on such transformations. GCS-Q formulates each partitioning step as a QUBO problem, enabling it to leverage quantum annealing for efficient exploration of exponentially large solution spaces. We validate our approach on both synthetic and real-world financial data, benchmarking against state-of-the-art classical algorithms such as SPONGE and k-Medoids. Our experiments demonstrate that GCS-Q consistently achieves higher clustering quality, as measured by Adjusted Rand Index and structural balance penalties, while dynamically determining the number of clusters. These results highlight the practical utility of near-term quantum computing for graph-based unsupervised learning in financial applications.

algorithm, artificial intelligence, machine learning, (14 more...)

2509.07766

Country:

Europe (1.00)
North America > United States > Nevada (0.15)

Genre: Research Report (0.50)

Industry: Banking & Finance > Trading (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Wilkinson, Meghan, Thomson, Robert H

What Does Normal Even Mean? Evaluating Benign Traffic in Intrusion Detection Datasets

arXiv.org Artificial IntelligenceSep-12-2025

Supervised machine learning techniques rely on labeled data to achieve high task performance, but this requires the labels to capture some meaningful differences in the underlying data structure. For training network intrusion detection algorithms, most datasets contain a series of attack classes and a single large benign class which captures all non-attack network traffic. A review of intrusion detection papers and guides that explicitly state their data preprocessing steps identified that the majority took the labeled categories of the dataset at face value when training their algorithms. The present paper evaluates the structure of benign traffic in several common intrusion detection datasets (NSL-KDD, UNSW-NB15, and CIC-IDS 2017) and determines whether there are meaningful sub-categories within this traffic which may improve overall multi-classification performance using common machine learning techniques. We present an overview of some unsupervised clustering techniques (e.g., HDBSCAN, Mean Shift Clustering) and show how they differentially cluster the benign traffic space.

artificial intelligence, dataset, machine learning, (14 more...)

2509.09564

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Artificial IntelligenceSep-12-2025

Persistent Homology of Topic Networks for the Prediction of Reader Curiosity

Hopp, Manuel D. S., Labatut, Vincent, Amalvy, Arthur, Dufour, Richard, Stone, Hannah, Jach, Hayley, Murayama, Kou

Reader curiosity, the drive to seek information, is crucial for textual engagement, yet remains relatively underexplored in NLP. Building on Loewenstein's Information Gap Theory, we introduce a framework that models reader curiosity by quantifying semantic information gaps within a text's semantic structure. Our approach leverages BERTopic-inspired topic modeling and persistent homology to analyze the evolving topology (connected components, cycles, voids) of a dynamic semantic network derived from text segments, treating these features as proxies for information gaps. To empirically evaluate this pipeline, we collect reader curiosity ratings from participants (n = 49) as they read S. Collins's ''The Hunger Games'' novel. We then use the topological features from our pipeline as independent variables to predict these ratings, and experimentally show that they significantly improve curiosity prediction compared to a baseline model (73% vs. 30% explained deviance), validating our approach. This pipeline offers a new computational method for analyzing text structure and its relation to reader engagement.

curiosity, data mining, machine learning, (19 more...)

doi: 10.18653/v1/2025.acl-long.1364

2506.11095

Country:

Asia (0.67)
Europe > United Kingdom (0.46)
Europe > Germany (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.67)

Industry: Education (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)