Clustering
SAModified: A Foundation Model-Based Zero-Shot Approach for Refining Noisy Land-Use Land-Cover Maps
Pekhale, Sparsh, Sathish, Rakshith, Basavaraju, Sathisha, Sharma, Divya
Land-use and land cover (LULC) analysis is critical in remote sensing, with wide-ranging applications across diverse fields such as agriculture, utilities, and urban planning. However, automating LULC map generation using machine learning is rendered challenging due to noisy labels. Typically, the ground truths (e.g. ESRI LULC, MapBioMass) have noisy labels that hamper the model's ability to learn to accurately classify the pixels. Further, these erroneous labels can significantly distort the performance metrics of a model, leading to misleading evaluations. Traditionally, the ambiguous labels are rectified using unsupervised algorithms. These algorithms struggle not only with scalability but also with generalization across different geographies. To overcome these challenges, we propose a zero-shot approach using the foundation model, Segment Anything Model (SAM), to automatically delineate different land parcels/regions and leverage them to relabel the unsure pixels by using the local label statistics within each detected region. We achieve a significant reduction in label noise and an improvement in the performance of the downstream segmentation model by $\approx 5\%$ when trained with denoised labels.
Knowledge Graphs: The Future of Data Integration and Insightful Discovery
Mohamed, Saher, Farah, Kirollos, Lotfy, Abdelrahman, Rizk, Kareem, Saeed, Abdelrahman, Mohamed, Shahenda, Khouriba, Ghada, Arafa, Tamer
Knowledge graphs are an efficient method for representing and connecting information across various concepts, useful in reasoning, question answering, and knowledge base completion tasks. They organize data by linking points, enabling researchers to combine diverse information sources into a single database. This interdisciplinary approach helps uncover new research questions and ideas. Knowledge graphs create a web of data points (nodes) and their connections (edges), which enhances navigation, comprehension, and utilization of data for multiple purposes. They capture complex relationships inherent in unstructured data sources, offering a semantic framework for diverse entities and their attributes. Strategies for developing knowledge graphs include using seed data, named entity recognition, and relationship extraction. These graphs enhance chatbot accuracy and include multimedia data for richer information. Creating high-quality knowledge graphs involves both automated methods and human oversight, essential for accurate and comprehensive data representation.
One Node One Model: Featuring the Missing-Half for Graph Clustering
Xie, Xuanting, Li, Bingheng, Pan, Erlin, Guo, Zhaochen, Kang, Zhao, Chen, Wenyu
Most existing graph clustering methods primarily focus on exploiting topological structure, often neglecting the ``missing-half" node feature information, especially how these features can enhance clustering performance. This issue is further compounded by the challenges associated with high-dimensional features. Feature selection in graph clustering is particularly difficult because it requires simultaneously discovering clusters and identifying the relevant features for these clusters. To address this gap, we introduce a novel paradigm called ``one node one model", which builds an exclusive model for each node and defines the node label as a combination of predictions for node groups. Specifically, the proposed ``Feature Personalized Graph Clustering (FPGC)" method identifies cluster-relevant features for each node using a squeeze-and-excitation block, integrating these features into each model to form the final representations. Additionally, the concept of feature cross is developed as a data augmentation technique to learn low-order feature interactions. Extensive experimental results demonstrate that FPGC outperforms state-of-the-art clustering methods. Moreover, the plug-and-play nature of our method provides a versatile solution to enhance GNN-based models from a feature perspective.
Application of machine learning in grain-related clustering of Laue spots in a polycrystalline energy dispersive Laue pattern
Tosson, Amir, Shokr, Mohammad, Humaidi, Mahmoud Al, Mikayelyan, Eduard, Gutt, Christian, Pietsch, Ulrich
We address the identification of grain-corresponding Laue reflections in energy dispersive Laue diffraction (EDLD) experiments by formulating it as a clustering problem solvable through unsupervised machine learning (ML). To achieve reliable and efficient identification of grains in a Laue pattern, we employ a combination of clustering algorithms, namely hierarchical clustering (HC) and K-means. These algorithms allow us to group together similar Laue reflections, revealing the underlying grain structure in the diffraction pattern. Additionally, we utilise the elbow method to determine the optimal number of clusters, ensuring accurate results. To evaluate the performance of our proposed method, we conducted experiments using both simulated and experimental datasets obtained from nickel wires. The simulated datasets were generated to mimic the characteristics of real-world EDLD experiments, while the experimental datasets were obtained from actual measurements.
Cost-Effective Label-free Node Classification with LLMs
Zhang, Taiyan, Yang, Renchi, Yan, Mingyu, Ye, Xiaochun, Fan, Dongrui, Lai, Yurui
Graph neural networks (GNNs) have emerged as go-to models for node classification in graph data due to their powerful abilities in fusing graph structures and attributes. However, such models strongly rely on adequate high-quality labeled data for training, which are expensive to acquire in practice. With the advent of large language models (LLMs), a promising way is to leverage their superb zero-shot capabilities and massive knowledge for node labeling. Despite promising results reported, this methodology either demands considerable queries to LLMs, or suffers from compromised performance caused by noisy labels produced by LLMs. To remedy these issues, this work presents Cella, an active self-training framework that integrates LLMs into GNNs in a cost-effective manner. The design recipe of Cella is to iteratively identify small sets of "critical" samples using GNNs and extract informative pseudo-labels for them with both LLMs and GNNs as additional supervision signals to enhance model training. Particularly, Cella includes three major components: (i) an effective active node selection strategy for initial annotations; (ii) a judicious sample selection scheme to sift out the "critical" nodes based on label disharmonicity and entropy; and (iii) a label refinement module combining LLMs and GNNs with rewired topology. Our extensive experiments over five benchmark text-attributed graph datasets demonstrate that Cella significantly outperforms the state of the arts under the same query budget to LLMs in terms of label-free node classification. In particular, on the DBLP dataset with 14.3k nodes, Cella is able to achieve an 8.08% conspicuous improvement in accuracy over the state-of-the-art at a cost of less than one cent.
LITA: An Efficient LLM-assisted Iterative Topic Augmentation Framework
Chang, Chia-Hsuan, Tsai, Jui-Tse, Tsai, Yi-Hang, Hwang, San-Yih
Topic modeling is widely used for uncovering thematic structures within text corpora, yet traditional models often struggle with specificity and coherence in domain-focused applications. Guided approaches, such as SeededLDA and CorEx, incorporate user-provided seed words to improve relevance but remain labor-intensive and static. Large language models (LLMs) offer potential for dynamic topic refinement and discovery, yet their application often incurs high API costs. To address these challenges, we propose the LLM-assisted Iterative Topic Augmentation framework (LITA), an LLM-assisted approach that integrates user-provided seeds with embedding-based clustering and iterative refinement. LITA identifies a small number of ambiguous documents and employs an LLM to reassign them to existing or new topics, minimizing API costs while enhancing topic quality. Experiments on two datasets across topic quality and clustering performance metrics demonstrate that LITA outperforms five baseline models, including LDA, SeededLDA, CorEx, BERTopic, and PromptTopic. Our work offers an efficient and adaptable framework for advancing topic modeling and text clustering.
An Incremental Clustering Baseline for Event Detection on Twitter
Ray, Marjolaine, Wang, Qi, Mélanie-Becquet, Frédérique, Poibeau, Thierry, Mazoyer, Béatrice
Event detection in text streams is a crucial task for the analysis of online media and social networks. One of the current challenges in this field is establishing a performance standard while maintaining an acceptable level of computational complexity. In our study, we use an incremental clustering algorithm combined with recent advancements in sentence embeddings. Our objective is to compare our findings with previous studies, specifically those by Cao et al. (2024) and Mazoyer et al. (2020). Our results demonstrate significant improvements and could serve as a relevant baseline for future research in this area.
Multiplex Dirichlet stochastic block model for clustering multidimensional compositional networks
Promskaia, Iuliia, O'Hagan, Adrian, Fop, Michael
Network data often represent multiple types of relations, which can also denote exchanged quantities, and are typically encompassed in a weighted multiplex. Such data frequently exhibit clustering structures, however, traditional clustering methods are not well-suited for multiplex networks. Additionally, standard methods treat edge weights in their raw form, potentially biasing clustering towards a node's total weight capacity rather than reflecting cluster-related interaction patterns. To address this, we propose transforming edge weights into a compositional format, enabling the analysis of connection strengths in relative terms and removing the impact of nodes' total weights. We introduce a multiplex Dirichlet stochastic block model designed for multiplex networks with compositional layers. This model accounts for sparse compositional networks and enables joint clustering across different types of interactions. We validate the model through a simulation study and apply it to the international export data from the Food and Agriculture Organization of the United Nations.
Deep Spectral Clustering via Joint Spectral Embedding and Kmeans
Spectral clustering is a popular clustering method. It first maps data into the spectral embedding space and then uses Kmeans to find clusters. However, the two decoupled steps prohibit joint optimization for the optimal solution. In addition, it needs to construct the similarity graph for samples, which suffers from the curse of dimensionality when the data are high-dimensional. To address these two challenges, we introduce \textbf{D}eep \textbf{S}pectral \textbf{C}lustering (\textbf{DSC}), which consists of two main modules: the spectral embedding module and the greedy Kmeans module. The former module learns to efficiently embed raw samples into the spectral embedding space using deep neural networks and power iteration. The latter module improves the cluster structures of Kmeans on the learned spectral embeddings by a greedy optimization strategy, which iteratively reveals the direction of the worst cluster structures and optimizes embeddings in this direction. To jointly optimize spectral embeddings and clustering, we seamlessly integrate the two modules and optimize them in an end-to-end manner. Experimental results on seven real-world datasets demonstrate that DSC achieves state-of-the-art clustering performance.
Beyond Discrete Personas: Personality Modeling Through Journal Intensive Conversations
Pal, Sayantan, Das, Souvik, Srihari, Rohini K.
Large Language Models (LLMs) have significantly improved personalized conversational capabilities. However, existing datasets like Persona Chat, Synthetic Persona Chat, and Blended Skill Talk rely on static, predefined personas. This approach often results in dialogues that fail to capture human personalities' fluid and evolving nature. To overcome these limitations, we introduce a novel dataset with around 400,000 dialogues and a framework for generating personalized conversations using long-form journal entries from Reddit. Our approach clusters journal entries for each author and filters them by selecting the most representative cluster, ensuring that the retained entries best reflect the author's personality. We further refine the data by capturing the Big Five personality traits --openness, conscientiousness, extraversion, agreeableness, and neuroticism --ensuring that dialogues authentically reflect an individual's personality. Using Llama 3 70B, we generate high-quality, personality-rich dialogues grounded in these journal entries. Fine-tuning models on this dataset leads to an 11% improvement in capturing personality traits on average, outperforming existing approaches in generating more coherent and personality-driven dialogues.