AITopics

Phase-change materials (PCMs) such as Ge-Sb-Te alloys are widely used in non-volatile memory applications due to their rapid and reversible switching between amorphous and crystalline states. However, their functional properties are strongly governed by nanoscale variations in composition and structure, which are challenging to resolve using conventional techniques. Here, we apply unsupervised machine learning to 4-dimensional scanning transmission electron microscopy (4D-STEM) data to identify compositional and structural heterogeneity in Ge-Sb-Te. After preprocessing and dimensionality reduction with principal component analysis (PCA), cluster validation was performed with t-SNE and UMAP, followed by k-means clustering optimized through silhouette scoring. Four distinct clusters were identified which were mapped back to the diffraction data. Elemental intensity histograms revealed chemical signatures change across clusters, oxygen and germanium enrichment in Cluster 1, tellurium in Cluster 2, antimony in Cluster 3, and germanium again in Cluster 4. Furthermore, averaged diffraction patterns from these clusters confirmed structural variations. Together, these findings demonstrate that clustering analysis can provide a powerful framework for correlating local chemical and structural features in PCMs, offering deeper insights into their intrinsic heterogeneity.

artificial intelligence, diffraction pattern, machine learning, (16 more...)

2509.00943

Country: Asia > Middle East > Israel (0.04)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Bastola, Deepak, Choi, Woohyeok

Hybrid Topic-Semantic Labeling and Graph Embeddings for Unsupervised Legal Document Clustering

arXiv.org Machine LearningSep-3-2025

Legal documents pose unique challenges for text classification due to their domain-specific language and often limited labeled data. This paper proposes a hybrid approach for classifying legal texts by combining unsupervised topic and graph embeddings with a supervised model. We employ Top2Vec to learn semantic document embeddings and automatically discover latent topics, and Node2Vec to capture structural relationships via a bipartite graph of legal documents. The embeddings are combined and clustered using KMeans, yielding coherent groupings of documents. Our computations on a legal document dataset demonstrate that the combined Top2Vec+Node2Vec approach improves clustering quality over text-only or graph-only embeddings. We conduct a sensitivity analysis of hyperparameters, such as the number of clusters and the dimensionality of the embeddings, and demonstrate that our method achieves competitive performance against baseline Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF) models. Key findings indicate that while the pipeline presents an innovative approach to unsupervised legal document analysis by combining semantic topic modeling with graph embedding techniques, its efficacy is contingent upon the quality of initial topic generation and the representational power of the chosen embedding models for specialized legal language. Strategic recommendations include the exploration of domain-specific embeddings, more comprehensive hyperparameter tuning for Node2Vec, dynamic determination of cluster numbers, and robust human-in-the-loop validation processes to enhance legal relevance and trustworthiness. The pipeline demonstrates potential for exploratory legal data analysis and as a precursor to supervised learning tasks but requires further refinement and domain-specific adaptation for practical legal applications.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Machine Learning

2509.0099

Country:

North America > United States > California (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Asia > Middle East > Jordan (0.04)
Asia > India (0.04)

Genre:

Research Report (1.00)
Overview (0.88)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Dereure, Erwan, Mfoumou, Emmanuel Akame, Holcman, David

Assessing One-Dimensional Cluster Stability by Extreme-Point Trimming

arXiv.org Machine LearningSep-3-2025

The automated identification of clusters or isolated points is a fundamental step in many classification and spatial analysis pipelines [1, 2, 3] to identify structures in unlabeled data. Clustering typically begins by assigning labels to data points, indicating their membership to one or more groups. However, the strategies used to define these groups can vary significantly across clustering methods, depending on the underlying assumptions about data structure, density, or similarity. Clustering and classification algorithms can be broadly categorized into partitioning-based, hierarchical, and density-based methods. Partitioning methods, such as K-means [4, 5], Spectral Clustering [6], and Support Vector Machines (SVMs) [7], divide the data into distinct groups by optimizing specific criteria. K-means partitions data into a fixed number of spherical clusters by minimizing within-cluster variance. Spectral Clustering extends partitioning by leveraging the eigenstructure of similarity graphs to identify clusters with complex, non-convex shapes through an embedding step followed by a partitioning algorithm. Similarly, SVMs perform classification by implicitly mapping data into higher-dimensional feature spaces using the kernel trick, effectively partitioning data through linear separation in that transformed space.

artificial intelligence, machine learning, statistics, (15 more...)

arXiv.org Machine Learning

2509.00258

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
North America > United States > California (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Predicting Multi-Type Talented Students in Secondary School Using Semi-Supervised Machine Learning

Zheng, Xinzhe, Yang, Zhen-Qun, Cao, Jiannong, Cheng, Jiabei

--T alent identification plays a critical role in promoting student development. However, traditional approaches often rely on manual processes or focus narrowly on academic achievement, and typically delaying intervention until the higher education stage. This oversight overlooks diverse non-academic talents and misses opportunities for early intervention. T o address this gap, this study introduces T alentPredictor, a novel semi-supervised multi-modal neural network that combines Transformer, LSTM, and ANN architectures. This model is designed to predict seven different talent types--academic, sport, art, leadership, service, technology, and others--in secondary school students within an offline educational setting. Drawing on existing offline educational data from 1,041 local secondary students, T alentPredictor overcomes the limitations of traditional talent identification methods. By clustering various award records into talent categories and extracting features from students' diverse learning behaviors, it achieves high prediction accuracy (0.908 classification accuracy, 0.908 ROCAUC). This demonstrates the potential of machine learning to identify diverse talents early in student development. ALENT is a critical component in human society. It is indispensable to the development of societies and the competitiveness of countries. Last but not least, talent is always in high demand. Thus, nurturing talent is the top priority for every part of the earth, and in it, talent identification is the foundation, as you must have a target individual to nurture talent. Traditional talent identification aims to give students tests that exceed their current level. For example, give grade eight students college admissions tests and use the result of the tough test as a talent score.

artificial intelligence, machine learning, student, (16 more...)

2509.00863

Country: Asia > China (0.14)

Genre: Research Report (1.00)

Industry:

Education > Educational Setting > K-12 Education > Secondary School (1.00)
Education > Educational Setting > Higher Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Brady, Nathan, Tennyson, David, Vandermeulen, Thomas

Machine Learning the 6d Supergravity Landscape

In this paper, we apply both supervised and unsupervised machine learning algorithms to the study of the string landscape and swampland in 6-dimensions. Our data are the (almost) anomaly-free 6-dimensional $\mathcal{N} = (1,0)$ supergravity models, characterised by the Gram matrix of anomaly coefficients. Our work demonstrates the ability of machine learning algorithms to efficiently learn highly complex features of the landscape and swampland. Employing an autoencoder for unsupervised learning, we provide an auto-classification of these models by compressing the Gram matrix data to 2-dimensions. Through compression, similar models cluster together, and we identify prominent features of these clusters. The autoencoder also identifies outlier models which are difficult to reconstruct. One of these outliers proves to be incredibly difficult to combine with other models such that the $\text{tr}R^{4}$ anomaly vanishes, making its presence in the landscape extremely rare. Further, we utilise supervised learning to build two classifiers predicting (1) model consistency under probe string insertion (precision: 0.78, predicting consistency for 214,837 models with reasonable certainty) and (2) inconsistency under anomaly inflow (precision: 0.91, predicting inconsistency for 1,909,359 models). Notably, projecting these predictions onto the autoencoder's 2-dimensional latent layer shows consistent models clustering together, further indicating that the autoencoder has learnt interesting and complex features of the set of models and potentially offers a novel approach to mapping the landscape and swampland of 6-dimensional supergravity theories.

artificial intelligence, autoencoder, machine learning, (18 more...)

2505.16131

Country: North America > United States > Texas (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Zuziak, Maciej Krzysztof, Pellungrini, Roberto, Rinzivillo, Salvatore

One-Shot Clustering for Federated Learning Under Clustering-Agnostic Assumption

Federated Learning (FL) is a widespread and well-adopted paradigm of decentralised learning that allows training one model from multiple sources without the need to transfer data between participating clients directly. Since its inception in 2015, it has been divided into numerous subfields that deal with application-specific issues, such as data heterogeneity or resource allocation. One such sub-field, Clustered Federated Learning (CFL), deals with the problem of clustering the population of clients into separate cohorts to deliver personalised models. Although a few remarkable works have been published in this domain, the problem remains largely unexplored, as its basic assumptions and settings differ slightly from those of standard FL. In this work, we present One-Shot Clustered Federated Learning (OCFL), a clustering-agnostic algorithm that can automatically detect the earliest suitable moment for clustering. Our algorithm is based on computing the cosine distance between the gradients of the clients and a temperature measure that detects when the federated model starts to converge. We empirically evaluate our methodology by testing various one-shot clustering algorithms for over forty different tasks on five benchmark datasets. Our experiments showcase the good performance of our approach when used to perform CFL in an automated manner without the need to adjust hyperparameters. We also revisit the practical feasibility of CFL algorithms based on the gradients of the clients, providing firm evidence of the high efficiency of density-based clustering methods when used to differentiate between the loss surfaces of neural networks trained on different distributions. Moreover, by inspecting the feasibility of local explanations generated with the help of GradCAM, we can provide more insights into the relationship between personalisation and the explainability of local predictions.

algorithm, artificial intelligence, machine learning, (16 more...)

2509.01587

Country:

Europe (0.28)
North America (0.28)

Genre: Research Report > Experimental Study (0.46)

Industry: Information Technology > Security & Privacy (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Prediction, Generation of WWTPs microbiome community structures and Clustering of WWTPs various feature attributes using DE-BP model, SiTime-GAN model and DPNG-EPMC ensemble clustering algorithm with modulation of microbial ecosystem health

Dai, Mingzhi, Cai, Weiwei, Feng, Xiang, Yu, Huiqun, Guo, Weibin, Guo, Miao

Microbiomes not only underpin Earth's biogeochemical cycles but also play crucial roles in both engineered and natural ecosystems, such as the soil, wastewater treatment, and the human gut. However, microbiome engineering faces significant obstacles to surmount to deliver the desired improvements in microbiome control. Here, we use the backpropagation neural network (BPNN), optimized through differential evolution (DE-BP), to predict the microbial composition of activated sludge (AS) systems collected from wastewater treatment plants (WWTPs) located worldwide. Furthermore, we introduce a novel clustering algorithm termed Directional Position Nonlinear Emotional Preference Migration Behavior Clustering (DPNG-EPMC). This method is applied to conduct a clustering analysis of WWTPs across various feature attributes. Finally, we employ the Similar Time Generative Adversarial Networks (SiTime-GAN), to synthesize novel microbial compositions and feature attributes data. As a result, we demonstrate that the DE-BP model can provide superior predictions of the microbial composition. Additionally, we show that the DPNG-EPMC can be applied to the analysis of WWTPs under various feature attributes. Finally, we demonstrate that the SiTime-GAN model can generate valuable incremental synthetic data. Our results, obtained through predicting the microbial community and conducting analysis of WWTPs under various feature attributes, develop an understanding of the factors influencing AS communities.

de-bp model, evolutionary algorithm, machine learning, (18 more...)

2509.01526

Country: Asia > China (0.28)

Genre: Research Report (0.70)

Industry:

Water & Waste Management > Water Management > Water Supplies & Services (1.00)
Water & Waste Management > Water Management > Lifecycle > Treatment (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

Lim, Matte, Yeh, Catherine, Wattenberg, Martin, Viégas, Fernanda, Michalatos, Panagiotis

Chronotome: Real-Time Topic Modeling for Streaming Embedding Spaces

Harvard University Figure 1: T o visualize how topics evolve in real time, we create a rotatable embedding space where time is encoded along the Z-axis. We provide three preset views to help users explore topic clusters from different perspectives: (A) Front View (overall clusters), (B) Iso View (clusters over time), and (C) Side View (clusters over time). Here, each point represents an image from a dataset of Picasso's paintings, batched into 5-year intervals. Many real-world datasets - from an artist's body of work to a person's social media history - exhibit meaningful semantic changes over time that are difficult to capture with existing dimensionality reduction methods. To address this gap, we introduce a visualization technique that combines force-based projection and streaming clustering methods to build a spatial-temporal map of embeddings. We demonstrate the utility of our approach through use cases on text and image data, showing how it offers a new lens for understanding the aesthetics and semantics of temporal datasets.

artificial intelligence, machine learning, natural language, (17 more...)

2509.01051

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Industry:

Leisure & Entertainment (0.69)
Media > Film (0.47)
Information Technology (0.46)
Health & Medicine > Therapeutic Area > Immunology (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Unsupervised Dataset Cleaning Framework for Encrypted Traffic Classification

Qiu, Kun, Wang, Ying, Li, Baoqian, Zhu, Wenjun

Traffic classification, a technique for assigning network flows to predefined categories, has been widely deployed in enterprise and carrier networks. With the massive adoption of mobile devices, encryption is increasingly used in mobile applications to address privacy concerns. Consequently, traditional methods such as Deep Packet Inspection (DPI) fail to distinguish encrypted traffic. To tackle this challenge, Artificial Intelligence (AI), in particular Machine Learning (ML), has emerged as a promising solution for encrypted traffic classification. A crucial prerequisite for any ML-based approach is traffic data cleaning, which removes flows that are not useful for training (e.g., irrelevant protocols, background activity, control-plane messages, and long-lived sessions). Existing cleaning solutions depend on manual inspection of every captured packet, making the process both costly and time-consuming. In this poster, we present an unsupervised framework that automatically cleans encrypted mobile traffic. Evaluation on real-world datasets shows that our framework incurs only a 2%~2.5% reduction in classification accuracy compared with manual cleaning. These results demonstrate that our method offers an efficient and effective preprocessing step for ML-based encrypted traffic classification.

artificial intelligence, machine learning, traffic, (13 more...)

2509.00701

Country: Asia > China (0.16)

Genre: Research Report > New Finding (0.36)

Industry: Information Technology > Security & Privacy (0.35)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Advanced spectral clustering for heterogeneous data in credit risk monitoring systems

Han, Lu, Li, Mengyan, Qiang, Jiping, Su, Zhi

Heterogeneous data, which encompass both numerical financial variables and textual records, present substantial challenges for credit monitoring. To address this issue, we propose Advanced Spectral Clustering (ASC), a method that integrates financial and textual similarities through an optimized weight parameter and selects eigenvectors using a novel eigenvalue-silhouette optimization approach. Evaluated on a dataset comprising 1,428 small and medium-sized enterprises (SMEs), ASC achieves a Silhouette score that is 18% higher than that of a single-type data baseline method. Furthermore, the resulting clusters offer actionable insights; for instance, 51% of low-risk firms are found to include the term 'social recruitment' in their textual records. The robustness of ASC is confirmed across multiple clustering algorithms, including k-means, k-medians, and k-medoids, with ΔIntra/Inter < 0.13 and ΔSilhouette Coefficient < 0.02. By bridging spectral clustering theory with heterogeneous data applications, ASC enables the identification of meaningful clusters, such as recruitment-focused SMEs exhibiting a 30% lower default risk, thereby supporting more targeted and effective credit interventions.

data mining, machine learning, spectral, (20 more...)

2509.00546

Country: Asia > China (0.15)

Genre: Research Report > New Finding (0.68)

Industry:

Banking & Finance > Credit (1.00)
Banking & Finance > Risk Management (0.70)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)