AITopics

2509.02653

Country: North America > United States > Texas (0.67)

Genre: Research Report > New Finding (0.87)

Industry:

Energy > Power Industry (1.00)
Government > Regional Government > North America Government > United States Government (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Irawan, Patrick Amadeus, Diandaru, Ryandito, Syuhada, Belati Jagad Bintang, Suchrady, Randy Zakya, Aji, Alham Fikri, Winata, Genta Indra, Koto, Fajri, Cahyawijaya, Samuel

Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations

arXiv.org Artificial IntelligenceSep-8-2025

We introduce Entropy2Vec, a novel framework for deriving cross-lingual language representations by leveraging the entropy of monolingual language models. Unlike traditional typological inventories that suffer from feature sparsity and static snapshots, Entropy2Vec uses the inherent uncertainty in language models to capture typological relationships between languages. By training a language model on a single language, we hypothesize that the entropy of its predictions reflects its structural similarity to other languages: Low entropy indicates high similarity, while high entropy suggests greater divergence. This approach yields dense, non-sparse language embeddings that are adaptable to different timeframes and free from missing values. Empirical evaluations demonstrate that Entropy2Vec embeddings align with established typological categories and achieved competitive performance in downstream multilingual NLP tasks, such as those addressed by the LinguAlchemy framework.

computational linguistic, machine learning, natural language, (19 more...)

2509.0506

Country:

Europe (1.00)
North America (0.68)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.64)

arXiv.org Artificial IntelligenceSep-5-2025

Topic Identification in LLM Input-Output Pairs through the Lens of Information Bottleneck

Halperin, Igor

Large Language Models (LLMs) are prone to critical failure modes, including \textit{intrinsic faithfulness hallucinations} (also known as confabulations), where a response deviates semantically from the provided context. Frameworks designed to detect this, such as Semantic Divergence Metrics (SDM), rely on identifying latent topics shared between prompts and responses, typically by applying geometric clustering to their sentence embeddings. This creates a disconnect, as the topics are optimized for spatial proximity, not for the downstream information-theoretic analysis. In this paper, we bridge this gap by developing a principled topic identification method grounded in the Deterministic Information Bottleneck (DIB) for geometric clustering. Our key contribution is to transform the DIB method into a practical algorithm for high-dimensional data by substituting its intractable KL divergence term with a computationally efficient upper bound. The resulting method, which we dub UDIB, can be interpreted as an entropy-regularized and robustified version of K-means that inherently favors a parsimonious number of informative clusters. By applying UDIB to the joint clustering of LLM prompt and response embeddings, we generate a shared topic representation that is not merely spatially coherent but is fundamentally structured to be maximally informative about the prompt-response relationship. This provides a superior foundation for the SDM framework and offers a novel, more sensitive tool for detecting confabulations.

large language model, machine learning, natural language, (18 more...)

2509.03533

Genre: Research Report (1.00)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)

Bassini, Chiara Fusar, Xu, Alice Lixuan, Canales, Jorge Sánchez, Hirth, Lion, Kaack, Lynn H.

Revealing the empirical flexibility of gas units through deep clustering

arXiv.org Artificial IntelligenceSep-5-2025

The flexibility of a power generation unit determines how quickly and often it can ramp up or down. In energy models, it depends on assumptions on the technical characteristics of the unit, such as its installed capacity or turbine technology. In this paper, we learn the empirical flexibility of gas units from their electricity generation, revealing how real-world limitations can lead to substantial differences between units with similar technical characteristics. Using a novel deep clustering approach, we transform 5 years (2019-2023) of unit-level hourly generation data for 49 German units from 100 MWp of installed capacity into low-dimensional embeddings. Our unsupervised approach identifies two clusters of peaker units (high flexibility) and two clusters of non-peaker units (low flexibility). The estimated ramp rates of non-peakers, which constitute half of the sample, display a low empirical flexibility, comparable to coal units. Non-peakers, predominantly owned by industry and municipal utilities, show limited response to low residual load and negative prices, generating on average 1.3 GWh during those hours. As the transition to renewables increases market variability, regulatory changes will be needed to unlock this flexibility potential.

artificial intelligence, data mining, machine learning, (18 more...)

2504.16943

Country: Europe > Germany (0.29)

Genre: Research Report > New Finding (0.94)

Industry:

Energy > Power Industry (1.00)
Government > Regional Government > Europe Government (0.46)
Materials > Metals & Mining > Coal (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Coda, Elizabeth, Arias-Castro, Ery, Mishne, Gal

Cluster and then Embed: A Modular Approach for Visualization

arXiv.org Machine LearningSep-4-2025

Dimensionality reduction methods such as t-SNE and UMAP are popular methods for visualizing data with a potential (latent) clustered structure. They are known to group data points at the same time as they embed them, resulting in visualizations with well-separated clusters that preserve local information well. However, t-SNE and UMAP also tend to distort the global geometry of the underlying data. We propose a more transparent, modular approach consisting of first clustering the data, then embedding each cluster, and finally aligning the clusters to obtain a global embedding. We demonstrate this approach on several synthetic and real-world datasets and show that it is competitive with existing methods, while being much more transparent.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

2509.03373

Country:

Europe > Netherlands > South Holland > Leiden (0.05)
North America > United States > California > San Diego County > San Diego (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Workflow (0.68)

Technology:

Information Technology > Data Science > Data Mining (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Goudarzi, Hojjat Torabi, Pouyan, Maziyar Baran

Enhanced Single-Cell RNA-seq Embedding through Gene Expression and Data-Driven Gene-Gene Interaction Integration

Single-cell RNA sequencing (scRNA-seq) provides unprecedented insights into cellular heterogeneity, enabling detailed analysis of complex biological systems at single-cell resolution. However, the high dimensionality and technical noise inherent in scRNA-seq data pose significant analytical challenges. While current embedding methods focus primarily on gene expression levels, they often overlook crucial gene-gene interactions that govern cellular identity and function. To address this limitation, we present a novel embedding approach that integrates both gene expression profiles and data-driven gene-gene interactions. Our method first constructs a Cell-Leaf Graph (CLG) using random forest models to capture regulatory relationships between genes, while simultaneously building a K-Nearest Neighbor Graph (KNNG) to represent expression similarities between cells. These graphs are then combined into an Enriched Cell-Leaf Graph (ECLG), which serves as input for a graph neural network to compute cell embeddings. By incorporating both expression levels and gene-gene interactions, our approach provides a more comprehensive representation of cellular states. Extensive evaluation across multiple datasets demonstrates that our method enhances the detection of rare cell populations and improves downstream analyses such as visualization, clustering, and trajectory inference. This integrated approach represents a significant advance in single-cell data analysis, offering a more complete framework for understanding cellular diversity and dynamics.

artificial intelligence, interaction, machine learning, (17 more...)

doi: 10.1016/j.compbiomed.2025.109880

2509.02639

Country: North America > United States (0.68)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Contrastive clustering based on regular equivalence for influential node identification in complex networks

Hu, Yanmei, Wu, Yihang, Sun, Bing, Yue, Xue, Cai, Biao, Li, Xiangtao, Chen, Yang

Identifying influential nodes in complex networks is a fundamental task in network analysis with wide-ranging applications across domains. While deep learning has advanced node influence detection, existing supervised approaches remain constrained by their reliance on labeled data, limiting their applicability in real-world scenarios where labels are scarce or unavailable. While contrastive learning demonstrates significant potential for performance enhancement, existing approaches predominantly rely on multiple-embedding generation to construct positive/negative sample pairs. To overcome these limitations, we propose ReCC (\textit{r}egular \textit{e}quivalence-based \textit{c}ontrastive \textit{c}lustering), a novel deep unsupervised framework for influential node identification. We first reformalize influential node identification as a label-free deep clustering problem, then develop a contrastive learning mechanism that leverages regular equivalence-based similarity, which captures structural similarities between nodes beyond local neighborhoods, to generate positive and negative samples. This mechanism is integrated into a graph convolutional network to learn node embeddings that are used to differentiate influential from non-influential nodes. ReCC is pre-trained using network reconstruction loss and fine-tuned with a combined contrastive and clustering loss, with both phases being independent of labeled data. Additionally, ReCC enhances node representations by combining structural metrics with regular equivalence-based similarities. Extensive experiments demonstrate that ReCC outperforms state-of-the-art approaches across several benchmarks.

artificial intelligence, machine learning, node, (14 more...)

2509.02609

Country: Asia > China (0.28)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Petnehazi, Gabor, Aradi, Bernadett

HERCULES: Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization

The explosive growth of complex datasets across various modalities necessitates advanced analytical tools that not only group data effectively but also provide human-understandable insights into the discovered structures. We introduce HERCULES (Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization), a novel algorithm and Python package designed for hierarchical k-means clustering of diverse data types, including text, images, and numeric data (processed one modality per run). HERCULES constructs a cluster hierarchy by recursively applying k-means clustering, starting from individual data points at level 0. A key innovation is its deep integration of Large Language Models (LLMs) to generate semantically rich titles and descriptions for clusters at each level of the hierarchy, significantly enhancing interpretability. The algorithm supports two main representation modes: `direct' mode, which clusters based on original data embeddings or scaled numeric features, and `description' mode, which clusters based on embeddings derived from LLM-generated summaries. Users can provide a `topic\_seed' to guide LLM-generated summaries towards specific themes. An interactive visualization tool facilitates thorough analysis and understanding of the clustering results. We demonstrate HERCULES's capabilities and discuss its potential for extracting meaningful, hierarchical knowledge from complex datasets.

hercules, large language model, machine learning, (20 more...)

2506.19992

Genre: Research Report > Experimental Study (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

De Vita, Gianmarco, Humbatova, Nargiz, Tonella, Paolo

TopoMap: A Feature-based Semantic Discriminator of the Topographical Regions in the Test Input Space

Testing Deep Learning (DL)-based systems is an open challenge. Although it is relatively easy to find inputs that cause a DL model to misbehave, the grouping of inputs by features that make the DL model under test fail is largely unexplored. Existing approaches for DL testing introduce perturbations that may focus on specific failure-inducing features, while neglecting others that belong to different regions of the feature space. In this paper, we create an explicit topographical map of the input feature space. Our approach, named TopoMap, is both black-box and model-agnostic as it relies solely on features that characterise the input space. To discriminate the inputs according to the specific features they share, we first apply dimensionality reduction to obtain input embeddings, which are then subjected to clustering. Each DL model might require specific embedding computations and clustering algorithms to achieve a meaningful separation of inputs into discriminative groups. We propose a novel way to evaluate alternative configurations of embedding and clustering techniques. We used a deep neural network (DNN) as an approximation of a human evaluator who could tell whether a pair of clusters can be discriminated based on the features of the included elements. We use such a DNN to automatically select the optimal topographical map of the inputs among all those that are produced by different embedding/clustering configurations. The evaluation results show that the maps generated by TopoMap consist of distinguishable and meaningful regions. In addition, we evaluate the effectiveness of TopoMap using mutation analysis. In particular, we assess whether the clusters in our topographical map allow for an effective selection of mutation-killing inputs. Experimental results show that our approach outperforms random selection by 35% on average on killable mutants; by 61% on non-killable ones.

artificial intelligence, configuration, machine learning, (17 more...)

2509.03242

Country: North America > United States (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Transportation > Ground (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

A Data-Driven RetinaNet Model for Small Object Detection in Aerial Images

Tang, Zhicheng, Tang, Jinwen, Shang, Yi

In the realm of aerial imaging, the ability to detect small objects is pivotal for a myriad of applications, encompassing environmental surveillance, urban design, and crisis management. Leveraging RetinaNet, this work unveils DDR-Net: a data-driven, deep-learning model devised to enhance the detection of diminutive objects. DDR-Net introduces novel, data-driven techniques to autonomously ascertain optimal feature maps and anchor estimations, cultivating a tailored and proficient training process while maintaining precision. Additionally, this paper presents an innovative sampling technique to bolster model efficacy under limited data training constraints. The model's enhanced detection capabilities support critical applications including wildlife and habitat monitoring, traffic flow optimization, and public safety improvements through accurate identification of small objects like vehicles and pedestrians. DDR-Net significantly reduces the cost and time required for data collection and training, offering efficient performance even with limited data. Empirical assessments over assorted aerial avian imagery datasets demonstrate that DDR-Net markedly surpasses RetinaNet and alternative contemporary models. These innovations advance current aerial image analysis technologies and promise wide-ranging impacts across multiple sectors including agriculture, security, and archaeology.

artificial intelligence, deep learning, machine learning, (16 more...)

2509.02928

Country: North America > United States > Missouri (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.71)