AITopics

2508.17083

Country:

Europe (0.67)
Asia > India (0.45)
North America > United States (0.27)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Therapeutic Area > Oncology > Lung Cancer (0.45)
Health & Medicine > Therapeutic Area > Oncology > Brain Cancer (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(5 more...)

Abuhani, Diaa Addeen, Seccaroni, Marco, Mazzarello, Martina, Zualkernan, Imran, Duarte, Fabio, Ratti, Carlo

Unsupervised Urban Tree Biodiversity Mapping from Street-Level Imagery Using Spatially-Aware Visual Clustering

arXiv.org Artificial IntelligenceAug-26-2025

Urban tree biodiversity is critical for climate resilience, ecological stability, and livability in cities, yet most municipalities lack detailed knowledge of their canopies. Field-based inventories provide reliable estimates of Shannon and Simpson diversity but are costly and time-consuming, while supervised AI methods require labeled data that often fail to generalize across regions. We introduce an unsupervised clustering framework that integrates visual embeddings from street-level imagery with spatial planting patterns to estimate biodiversity without labels. Applied to eight North American cities, the method recovers genus-level diversity patterns with high fidelity, achieving low Wasserstein distances to ground truth for Shannon and Simpson indices and preserving spatial autocorrelation. This scalable, fine-grained approach enables biodiversity mapping in cities lacking detailed inventories and offers a pathway for continuous, low-cost monitoring to support equitable access to greenery and adaptive management of urban ecosystems.

artificial intelligence, data mining, machine learning, (19 more...)

2508.13814

Country:

North America > United States > California > San Francisco County > San Francisco (0.15)
North America > United States > California > Los Angeles County > Los Angeles (0.15)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)
Information Technology > Data Science > Data Mining (0.89)

Ghasemzadeh, Maryam, van Beek, Anton

NOSTRA: A noise-resilient and sparse data framework for trust region based multi objective Bayesian optimization

Multi-objective Bayesian optimization (MOBO) struggles with sparse (non-space-filling), scarce (limited observations) datasets affected by experimental uncertainty, where identical inputs can yield varying outputs. These challenges are common in physical and simulation experiments (e.g., randomized medical trials and, molecular dynamics simulations) and are therefore incompatible with conventional MOBO methods. As a result, experimental resources are inefficiently allocated, leading to subop-timal designs. To address this challenge, we introduce NOSTRA (Noisy and Sparse Data Trust Region-based Optimization Algorithm), a novel sampling framework that integrates prior knowledge of experimental uncertainty to construct more accurate surrogate models while employing trust regions to focus sampling on promising areas of the design space. By strategically leveraging prior information and refining search regions, NOSTRA accelerates convergence to the Pareto frontier, enhances data efficiency, and improves solution quality. Through two test functions with varying levels of experimental uncertainty, we demonstrate that NOSTRA outperforms existing methods in handling noisy, sparse, and scarce data. Specifically, we illustrate that, NOSTRA effectively prioritizes regions where samples enhance the accuracy of the identified Pareto frontier, offering a resource-efficient algorithm that is practical in scenarios with limited experimental budgets while ensuring efficient performance.

artificial intelligence, machine learning, trust region, (17 more...)

2508.16476

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Kavuri, Vivek Hruday, Shroff, Gargi, Mishra, Rahul

NEAT: Concept driven Neuron Attribution in LLMs

Locating neurons that are responsible for final predictions is important for opening the black-box large language models and understanding the inside mechanisms. Previous studies have tried to find mechanisms that operate at the neuron level but these methods fail to represent a concept and there is also scope for further optimization of compute required. In this paper, with the help of concept vectors, we propose a method for locating significant neurons that are responsible for representing certain concepts and term those neurons as concept neurons. If the number of neurons is n and the number of examples is m, we reduce the number of forward passes required from O(n*m) to just O(n) compared to the previous works and hence optimizing the time and computation required over previous works. We also compare our method with several baselines and previous methods and our results demonstrate better performance than most of the methods and are more optimal when compared to the state-of-the-art method. We, as part of our ablation studies, also try to optimize the search for the concept neurons by involving clustering methods. Finally, we apply our methods to find, turn off the neurons that we find, and analyze its implications in parts of hate speech and bias in LLMs, and we also evaluate our bias part in terms of Indian context. Our methodology, analysis and explanations facilitate understating of neuron-level responsibility for more broader and human-like concepts and also lay a path for future research in this direction of finding concept neurons and intervening them.

large language model, machine learning, natural language, (15 more...)

2508.15875

Country: North America > United States (0.69)

Genre:

Research Report > New Finding (0.54)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Rahman, Mohammad Wali Ur, Nevarez, Ric, Mim, Lamia Tasnim, Hariri, Salim

SDEC: Semantic Deep Embedded Clustering

The high dimensional and semantically complex nature of textual Big data presents significant challenges for text clustering, which frequently lead to suboptimal groupings when using conventional techniques like k-means or hierarchical clustering. This work presents Semantic Deep Embedded Clustering (SDEC), an unsupervised text clustering framework that combines an improved autoencoder with transformer-based embeddings to overcome these challenges. This novel method preserves semantic relationships during data reconstruction by combining Mean Squared Error (MSE) and Cosine Similarity Loss (CSL) within an autoencoder. Furthermore, a semantic refinement stage that takes advantage of the contextual richness of transformer embeddings is used by SDEC to further improve a clustering layer with soft cluster assignments and distributional loss. The capabilities of SDEC are demonstrated by extensive testing on five benchmark datasets: AG News, Yahoo! Answers, DBPedia, Reuters 2, and Reuters 5. The framework not only outperformed existing methods with a clustering accuracy of 85.7% on AG News and set a new benchmark of 53.63% on Yahoo! Answers, but also showed robust performance across other diverse text corpora. These findings highlight the significant improvements in accuracy and semantic comprehension of text data provided by SDEC's advances in unsupervised text clustering.

artificial intelligence, machine learning, natural language, (22 more...)

2508.15823

Country:

Asia (0.67)
North America > United States > Arizona (0.28)
North America > United States > California (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology (0.67)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

DIDS: Domain Impact-aware Data Sampling for Large Language Model Training

Shi, Weijie, Zhang, Jipeng, Wu, Yaguang, Fang, Jingzhi, Zhang, Ruiyuan, Xu, Jiajie, Zhu, Jia, Chen, Hao, Zhao, Yao, Han, Sirui, Zhou, Xiaofang

Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model's output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency. The code is available at https://github.com/shiweijiezero/DIDS.

large language model, machine learning, natural language, (18 more...)