AITopics

2506.08564

Country:

Europe (1.00)
Asia (1.00)
Africa (0.68)
North America > United States (0.46)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

arXiv.org Artificial IntelligenceJun-11-2025

Integration of Old and New Knowledge for Generalized Intent Discovery: A Consistency-driven Prototype-Prompting Framework

Wei, Xiao, Wang, Xiaobao, Zhuang, Ning, Wang, Chenyang, Wang, Longbiao, dang, Jianwu

Intent detection aims to identify user intents from natural language inputs, where supervised methods rely heavily on labeled in-domain (IND) data and struggle with out-of-domain (OOD) intents, limiting their practical applicability. Generalized Intent Discovery (GID) addresses this by leveraging unlabeled OOD data to discover new intents without additional annotation. However, existing methods focus solely on clustering unsupervised data while neglecting domain adaptation. Therefore, we propose a consistency-driven prototype-prompting framework for GID from the perspective of integrating old and new knowledge, which includes a prototype-prompting framework for transferring old knowledge from external sources, and a hierarchical consistency constraint for learning new knowledge from target domains. We conducted extensive experiments and the results show that our method significantly outperforms all baseline methods, achieving state-of-the-art results, which strongly demonstrates the effectiveness and generalization of our methods. Our source code is publicly available at https://github.com/smileix/cpp.

large language model, machine learning, natural language, (21 more...)

2506.0849

Country: Asia > China (0.29)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Abedi, Ali, Cladera, Fernando, Farajijalal, Mohsen, Ehsani, Reza

Adaptive Per-Tree Canopy Volume Estimation Using Mobile LiDAR in Structured and Unstructured Orchards

arXiv.org Artificial IntelligenceJun-11-2025

--We present a real-time system for per-tree canopy volume estimation using mobile LiDAR data collected during routine robotic navigation. Unlike prior approaches that rely on static scans or assume uniform orchard structures, our method adapts to varying field geometries via an integrated pipeline of LiDAR-inertial odometry, adaptive segmentation, and geometric reconstruction. We evaluate the system across two commercial orchards, one pistachio orchard with regular spacing and one almond orchard with dense, overlapping crowns. A hybrid clustering strategy combining DBSCAN and spectral clustering enables robust per-tree segmentation, achieving 93% success in pistachio and 80% in almond, with strong agreement to drone-derived canopy volume estimates. Accurate estimation of tree canopy volume is fundamental to orchard management, with applications in yield prediction, biomass assessment, and optimized resource allocation [1].

artificial intelligence, machine learning, orchard, (13 more...)

2506.08061

Country: North America > United States > California > Merced County > Merced (0.15)

Genre: Research Report (1.00)

Industry: Food & Agriculture > Agriculture (0.75)

Technology:

Information Technology > Artificial Intelligence > Robots (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

arXiv.org Machine LearningJun-10-2025

Strongly Consistent Community Detection in Popularity Adjusted Block Models

Yuan, Quan, Liu, Binghui, Li, Danning, Xue, Lingzhou

The Popularity Adjusted Block Model (PABM) provides a flexible framework for community detection in network data by allowing heterogeneous node popularity across communities. However, this flexibility increases model complexity and raises key unresolved challenges, particularly in effectively adapting spectral clustering techniques and efficiently achieving strong consistency in label recovery. To address these challenges, we first propose the Thresholded Cosine Spectral Clustering (TCSC) algorithm and establish its weak consistency under the PABM. We then introduce the one-step Refined TCSC algorithm and prove that it achieves strong consistency under the PABM, correctly recovering all community labels with high probability. We further show that the two-step Refined TCSC accelerates clustering error convergence, especially with small sample sizes. Additionally, we propose a data-driven approach for selecting the number of communities, which outperforms existing methods under the PABM. The effectiveness and robustness of our methods are validated through extensive simulations and real-world applications.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Machine Learning

2506.07224

Country:

North America > United States > New York (0.04)
North America > United States > Pennsylvania (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Hungary > Hajdú-Bihar County > Debrecen (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Karakuş, Oktay, Arkadaş, Hasan

Through the Gaps: Uncovering Tactical Line-Breaking Passes with Clustering

arXiv.org Machine LearningJun-10-2025

Line-breaking passes (LBPs) are crucial tactical actions in football, allowing teams to penetrate defensive lines and access high-value spaces. In this study, we present an unsupervised, clustering-based framework for detecting and analysing LBPs using synchronised event and tracking data from elite matches. Our approach models opponent team shape through vertical spatial segmentation and identifies passes that disrupt defensive lines within open play. Beyond detection, we introduce several tactical metrics, including the space build-up ratio (SBR) and two chain-based variants, LBPCh$^1$ and LBPCh$^2$, which quantify the effectiveness of LBPs in generating immediate or sustained attacking threats. We evaluate these metrics across teams and players in the 2022 FIFA World Cup, revealing stylistic differences in vertical progression and structural disruption. The proposed methodology is explainable, scalable, and directly applicable to modern performance analysis and scouting workflows.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Machine Learning

2506.06666

Country:

South America > Argentina (0.05)
Europe > Spain (0.05)
Europe > France (0.05)
(9 more...)

Genre: Research Report > New Finding (0.34)

Industry: Leisure & Entertainment > Sports > Soccer (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Zhang, Dekai, Williams, Matthew, Toni, Francesca

Clustered Federated Learning via Embedding Distributions

Federated learning (FL) is a widely used framework for machine learning in distributed data environments where clients hold data that cannot be easily centralised, such as for data protection reasons. FL, however, is known to be vulnerable to non-IID data. Clustered FL addresses this issue by finding more homogeneous clusters of clients. We propose a novel one-shot clustering method, EMD-CFL, using the Earth Mover's distance (EMD) between data distributions in embedding space. We theoretically motivate the use of EMDs using results from the domain adaptation literature and demonstrate empirically superior clustering performance in extensive comparisons against 16 baselines and on a range of challenging datasets.

artificial intelligence, deep learning, machine learning, (15 more...)

2506.07769

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Sana, Joydeb Kumar, Masud, Mohammad M., Rahman, M Sohel, Rahman, M Saifur

Patient Similarity Computation for Clinical Decision Support: An Efficient Use of Data Transformation, Combining Static and Time Series Data

Patient similarity computation (PSC) is a fundamental problem in healthcare informatics. The aim of the patient similarity computation is to measure the similarity among patients according to their historical clinical records, which helps to improve clinical decision support. This paper presents a novel distributed patient similarity computation (DPSC) technique based on data transformation (DT) methods, utilizing an effective combination of time series and static data. Time series data are sensor-collected patients' information, including metrics like heart rate, blood pressure, Oxygen saturation, respiration, etc. The static data are mainly patient background and demographic data, including age, weight, height, gender, etc. Static data has been used for clustering the patients. Before feeding the static data to the machine learning model adaptive Weight-of-Evidence (aWOE) and Z-score data transformation (DT) methods have been performed, which improve the prediction performances. In aWOE-based patient similarity models, sensitive patient information has been processed using aWOE which preserves the data privacy of the trained models. We used the Dynamic Time Warping (DTW) approach, which is robust and very popular, for time series similarity. However, DTW is not suitable for big data due to the significant computational run-time. To overcome this problem, distributed DTW computation is used in this study. For Coronary Artery Disease, our DT based approach boosts prediction performance by as much as 11.4%, 10.20%, and 12.6% in terms of AUC, accuracy, and F-measure, respectively. In the case of Congestive Heart Failure (CHF), our proposed method achieves performance enhancement up to 15.9%, 10.5%, and 21.9% for the same measures, respectively. The proposed method reduces the computation time by as high as 40%.

bioinformatics, data mining, machine learning, (18 more...)

2506.07092

Country:

Asia (0.93)
North America > United States > New York (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Biomedical Informatics > Clinical Informatics (1.00)
(3 more...)

Feature-Based Instance Neighbor Discovery: Advanced Stable Test-Time Adaptation in Dynamic World

Jiang, Qinting, Ye, Chuyang, Wei, Dongyan, Wang, Bingli, Xue, Yuan, Jiang, Jingyan, Wang, Zhi

Despite progress, deep neural networks still suffer performance declines under distribution shifts between training and test domains, leading to a substantial decrease in Quality of Experience (QoE) for applications. Existing test-time adaptation (TTA) methods are challenged by dynamic, multiple test distributions within batches. We observe that feature distributions across different domains inherently cluster into distinct groups with varying means and variances. This divergence reveals a critical limitation of previous global normalization strategies in TTA, which inevitably distort the original data characteristics. Based on this insight, we propose Feature-based Instance Neighbor Discovery (FIND), which comprises three key components: Layer-wise Feature Disentanglement (LFD), Feature Aware Batch Normalization (FABN) and Selective FABN (S-FABN). LFD stably captures features with similar distributions at each layer by constructing graph structures. While FABN optimally combines source statistics with test-time distribution specific statistics for robust feature representation. Finally, S-FABN determines which layers require feature partitioning and which can remain unified, thereby enhancing inference efficiency. Extensive experiments demonstrate that FIND significantly outperforms existing methods, achieving a 30\% accuracy improvement in dynamic scenarios while maintaining computational efficiency.

artificial intelligence, data mining, machine learning, (18 more...)

2506.06782

Country: Asia > China (0.29)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Is BERTopic Better than PLSA for Extracting Key Topics in Aviation Safety Reports?

Nanyonga, Aziida, Keith, Joiner, Ugur, Turhan, Graham, Wild

Is BERTopic Better than PLSA for Extracting Key Topics in Aviation Safety Reports? Abstract -- This study compares the effectiveness of BERTopic and Probabilistic Latent Semantic Analysis (PLSA) in extracting meaningful topics from aviation safety reports aiming to enhance the understanding of patterns in aviation incident data. Using a dataset of o ver 36,000 National Transportation Safety Board (NTSB) reports from 2000 - 2020, BERTopic employed transformer - based embeddings and hierarchical clustering, while PLSA utilized probabilistic modelling through the Expectation - Maximization (EM) algori thm. Results showed that BERTopic outperformed PLSA in topic coherence, achieving a C_v score of 0.41 compared to PLSA's 0.37, while also demonstrating superior interpretability as validated by aviation safety experts. These findings underscore the advantages of modern transformer - based approaches in analyzing complex aviatio n datasets, paving the way for enhanced insights and informed decision - making in aviation safety. Future work will explore hybrid models, multilingual datasets, and advanced clustering techniques to further improve topic modelling in this domain . The analysis of aviation safety reports is critical for identifying recurring issues and implementing measures to improve flight safety [1] .

artificial intelligence, machine learning, natural language, (18 more...)

2506.06328

Country: North America > United States (0.56)

Genre: Research Report > New Finding (1.00)

Industry:

Transportation > Air (1.00)
Government > Regional Government > North America Government > United States Government (0.56)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)

Proios, Dimitrios, Bornet, Alban, Yazdani, Anthony, Rodrigues, Jose F Jr, Teodoro, Douglas

ICU-TSB: A Benchmark for Temporal Patient Representation Learning for Unsupervised Stratification into Patient Cohorts

arXiv.org Artificial IntelligenceJun-9-2025

Patient stratification--identifying clinically meaningful subgroups--is essential for advancing personalized medicine through improved diagnostics and treatment strategies. Electronic health records (EHRs), particularly those from intensive care units (ICUs), contain rich temporal clinical data that can be leveraged for this purpose. In this work, we introduce ICU-TSB (Temporal Stratification Benchmark), the first comprehensive benchmark for evaluating patient stratification based on temporal patient representation learning using three publicly available ICU EHR datasets. A key contribution of our benchmark is a novel hierarchical evaluation framework utilizing disease taxonomies to measure the alignment of discovered clusters with clinically validated disease groupings. In our experiments with ICU-TSB, we compared statistical methods and several recurrent neural networks, including LSTM and GRU, for their ability to generate effective patient representations for subsequent clustering of patient trajectories. Our results demonstrate that temporal representation learning can rediscover clinically meaningful patient cohorts; nevertheless, it remains a challenging task, with v-measuring varying from up to 0.46 at the top level of the taxonomy to up to 0.40 at the lowest level. To further enhance the practical utility of our findings, we also evaluate multiple strategies for assigning interpretable labels to the identified clusters.

artificial intelligence, deep learning, machine learning, (19 more...)

2506.06192

Country:

Europe (0.28)
South America > Brazil (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area (0.94)
Health & Medicine > Health Care Providers & Services (0.89)
Health & Medicine > Health Care Technology > Medical Record (0.71)
Health & Medicine > Diagnostic Medicine (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)