AITopics

2412.12552

Country:

South America > Brazil (0.05)
Asia > India > Karnataka > Bengaluru (0.04)

Genre: Research Report (0.50)

Industry: Law > Real Estate Law (0.63)

Technology:

Information Technology > Data Science > Data Mining (0.95)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

arXiv.org Artificial IntelligenceDec-17-2024

Knowledge Graphs: The Future of Data Integration and Insightful Discovery

Mohamed, Saher, Farah, Kirollos, Lotfy, Abdelrahman, Rizk, Kareem, Saeed, Abdelrahman, Mohamed, Shahenda, Khouriba, Ghada, Arafa, Tamer

Knowledge graphs are an efficient method for representing and connecting information across various concepts, useful in reasoning, question answering, and knowledge base completion tasks. They organize data by linking points, enabling researchers to combine diverse information sources into a single database. This interdisciplinary approach helps uncover new research questions and ideas. Knowledge graphs create a web of data points (nodes) and their connections (edges), which enhances navigation, comprehension, and utilization of data for multiple purposes. They capture complex relationships inherent in unstructured data sources, offering a semantic framework for diverse entities and their attributes. Strategies for developing knowledge graphs include using seed data, named entity recognition, and relationship extraction. These graphs enhance chatbot accuracy and include multimedia data for richer information. Creating high-quality knowledge graphs involves both automated methods and human oversight, essential for accurate and comprehensive data representation.

data integration, graph, knowledge graph, (13 more...)

2502.15689

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > Thailand > Chiang Mai > Chiang Mai (0.04)
Africa > Middle East > Egypt > Giza Governorate > Giza (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology (1.00)
Transportation > Ground > Road (0.93)
Education (0.92)
(2 more...)

Technology:

Information Technology > Communications > Web > Semantic Web (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
(5 more...)

arXiv.org Artificial IntelligenceDec-17-2024

One Node One Model: Featuring the Missing-Half for Graph Clustering

Xie, Xuanting, Li, Bingheng, Pan, Erlin, Guo, Zhaochen, Kang, Zhao, Chen, Wenyu

Most existing graph clustering methods primarily focus on exploiting topological structure, often neglecting the ``missing-half" node feature information, especially how these features can enhance clustering performance. This issue is further compounded by the challenges associated with high-dimensional features. Feature selection in graph clustering is particularly difficult because it requires simultaneously discovering clusters and identifying the relevant features for these clusters. To address this gap, we introduce a novel paradigm called ``one node one model", which builds an exclusive model for each node and defines the node label as a combination of predictions for node groups. Specifically, the proposed ``Feature Personalized Graph Clustering (FPGC)" method identifies cluster-relevant features for each node using a squeeze-and-excitation block, integrating these features into each model to form the final representations. Additionally, the concept of feature cross is developed as a data augmentation technique to learn low-order feature interactions. Extensive experimental results demonstrate that FPGC outperforms state-of-the-art clustering methods. Moreover, the plug-and-play nature of our method provides a versatile solution to enhance GNN-based models from a feature perspective.

artificial intelligence, information, machine learning, (15 more...)

2412.09902

Country:

Asia > China (0.28)
North America > United States (0.28)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Tosson, Amir, Shokr, Mohammad, Humaidi, Mahmoud Al, Mikayelyan, Eduard, Gutt, Christian, Pietsch, Ulrich

Application of machine learning in grain-related clustering of Laue spots in a polycrystalline energy dispersive Laue pattern

We address the identification of grain-corresponding Laue reflections in energy dispersive Laue diffraction (EDLD) experiments by formulating it as a clustering problem solvable through unsupervised machine learning (ML). To achieve reliable and efficient identification of grains in a Laue pattern, we employ a combination of clustering algorithms, namely hierarchical clustering (HC) and K-means. These algorithms allow us to group together similar Laue reflections, revealing the underlying grain structure in the diffraction pattern. Additionally, we utilise the elbow method to determine the optimal number of clusters, ensuring accurate results. To evaluate the performance of our proposed method, we conducted experiments using both simulated and experimental datasets obtained from nickel wires. The simulated datasets were generated to mimic the characteristics of real-world EDLD experiments, while the experimental datasets were obtained from actual measurements.

artificial intelligence, machine learning, reflection, (17 more...)

2412.12224

Country:

North America > United States > Tennessee > Anderson County > Oak Ridge (0.04)
North America > United States > New York (0.04)
Europe > Germany > North Rhine-Westphalia > Arnsberg Region > Siegen (0.04)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)

Genre: Research Report > Promising Solution (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Cost-Effective Label-free Node Classification with LLMs

Zhang, Taiyan, Yang, Renchi, Yan, Mingyu, Ye, Xiaochun, Fan, Dongrui, Lai, Yurui

Graph neural networks (GNNs) have emerged as go-to models for node classification in graph data due to their powerful abilities in fusing graph structures and attributes. However, such models strongly rely on adequate high-quality labeled data for training, which are expensive to acquire in practice. With the advent of large language models (LLMs), a promising way is to leverage their superb zero-shot capabilities and massive knowledge for node labeling. Despite promising results reported, this methodology either demands considerable queries to LLMs, or suffers from compromised performance caused by noisy labels produced by LLMs. To remedy these issues, this work presents Cella, an active self-training framework that integrates LLMs into GNNs in a cost-effective manner. The design recipe of Cella is to iteratively identify small sets of "critical" samples using GNNs and extract informative pseudo-labels for them with both LLMs and GNNs as additional supervision signals to enhance model training. Particularly, Cella includes three major components: (i) an effective active node selection strategy for initial annotations; (ii) a judicious sample selection scheme to sift out the "critical" nodes based on label disharmonicity and entropy; and (iii) a label refinement module combining LLMs and GNNs with rewired topology. Our extensive experiments over five benchmark text-attributed graph datasets demonstrate that Cella significantly outperforms the state of the arts under the same query budget to LLMs in terms of label-free node classification. In particular, on the DBLP dataset with 14.3k nodes, Cella is able to achieve an 8.08% conspicuous improvement in accuracy over the state-of-the-art at a cost of less than one cent.

large language model, machine learning, natural language, (18 more...)

2412.11983

Country:

North America > United States > District of Columbia > Washington (0.05)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > China > Hong Kong (0.04)
(3 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology (1.00)
Health & Medicine (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Chang, Chia-Hsuan, Tsai, Jui-Tse, Tsai, Yi-Hang, Hwang, San-Yih

LITA: An Efficient LLM-assisted Iterative Topic Augmentation Framework

Topic modeling is widely used for uncovering thematic structures within text corpora, yet traditional models often struggle with specificity and coherence in domain-focused applications. Guided approaches, such as SeededLDA and CorEx, incorporate user-provided seed words to improve relevance but remain labor-intensive and static. Large language models (LLMs) offer potential for dynamic topic refinement and discovery, yet their application often incurs high API costs. To address these challenges, we propose the LLM-assisted Iterative Topic Augmentation framework (LITA), an LLM-assisted approach that integrates user-provided seeds with embedding-based clustering and iterative refinement. LITA identifies a small number of ambiguous documents and employs an LLM to reassign them to existing or new topics, minimizing API costs while enhancing topic quality. Experiments on two datasets across topic quality and clustering performance metrics demonstrate that LITA outperforms five baseline models, including LDA, SeededLDA, CorEx, BERTopic, and PromptTopic. Our work offers an efficient and adaptable framework for advancing topic modeling and text clustering.

large language model, machine learning, natural language, (17 more...)

2412.12459

Country:

North America > United States > Connecticut > New Haven County > New Haven (0.04)
Asia > Taiwan > Takao Province > Kaohsiung (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.95)

Ray, Marjolaine, Wang, Qi, Mélanie-Becquet, Frédérique, Poibeau, Thierry, Mazoyer, Béatrice

An Incremental Clustering Baseline for Event Detection on Twitter

Event detection in text streams is a crucial task for the analysis of online media and social networks. One of the current challenges in this field is establishing a performance standard while maintaining an acceptable level of computational complexity. In our study, we use an incremental clustering algorithm combined with recent advancements in sentence embeddings. Our objective is to compare our findings with previous studies, specifically those by Cao et al. (2024) and Mazoyer et al. (2020). Our results demonstrate significant improvements and could serve as a relevant baseline for future research in this area.

artificial intelligence, detection, machine learning, (16 more...)

doi: 10.18653/v1/2024.futured-1.2

2412.15257

Country:

Europe > France > Île-de-France > Paris > Paris (0.04)
Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Services (0.67)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

Promskaia, Iuliia, O'Hagan, Adrian, Fop, Michael

Multiplex Dirichlet stochastic block model for clustering multidimensional compositional networks

arXiv.org Machine LearningDec-16-2024

Network data often represent multiple types of relations, which can also denote exchanged quantities, and are typically encompassed in a weighted multiplex. Such data frequently exhibit clustering structures, however, traditional clustering methods are not well-suited for multiplex networks. Additionally, standard methods treat edge weights in their raw form, potentially biasing clustering towards a node's total weight capacity rather than reflecting cluster-related interaction patterns. To address this, we propose transforming edge weights into a compositional format, enabling the analysis of connection strengths in relative terms and removing the impact of nodes' total weights. We introduce a multiplex Dirichlet stochastic block model designed for multiplex networks with compositional layers. This model accounts for sparse compositional networks and enables joint clustering across different types of interactions. We validate the model through a simulation study and apply it to the international export data from the Food and Agriculture Organization of the United Nations.

artificial intelligence, machine learning, stochastic block model, (19 more...)

arXiv.org Machine Learning

2412.11971

Country:

North America > United States (0.14)
South America > Brazil (0.04)
South America > Argentina (0.04)
(79 more...)

Genre: Research Report (1.00)

Industry:

Food & Agriculture > Agriculture (0.68)
Government (0.66)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Artificial IntelligenceDec-15-2024

Deep Spectral Clustering via Joint Spectral Embedding and Kmeans

Guo, Wengang, Ye, Wei

Spectral clustering is a popular clustering method. It first maps data into the spectral embedding space and then uses Kmeans to find clusters. However, the two decoupled steps prohibit joint optimization for the optimal solution. In addition, it needs to construct the similarity graph for samples, which suffers from the curse of dimensionality when the data are high-dimensional. To address these two challenges, we introduce \textbf{D}eep \textbf{S}pectral \textbf{C}lustering (\textbf{DSC}), which consists of two main modules: the spectral embedding module and the greedy Kmeans module. The former module learns to efficiently embed raw samples into the spectral embedding space using deep neural networks and power iteration. The latter module improves the cluster structures of Kmeans on the learned spectral embeddings by a greedy optimization strategy, which iteratively reveals the direction of the worst cluster structures and optimizes embeddings in this direction. To jointly optimize spectral embeddings and clustering, we seamlessly integrate the two modules and optimize them in an end-to-end manner. Experimental results on seven real-world datasets demonstrate that DSC achieves state-of-the-art clustering performance.

artificial intelligence, machine learning, spectral, (19 more...)

2412.1108

Country:

North America > United States (0.14)
Asia > Middle East > Jordan (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Pal, Sayantan, Das, Souvik, Srihari, Rohini K.

Beyond Discrete Personas: Personality Modeling Through Journal Intensive Conversations

arXiv.org Artificial IntelligenceDec-15-2024

Large Language Models (LLMs) have significantly improved personalized conversational capabilities. However, existing datasets like Persona Chat, Synthetic Persona Chat, and Blended Skill Talk rely on static, predefined personas. This approach often results in dialogues that fail to capture human personalities' fluid and evolving nature. To overcome these limitations, we introduce a novel dataset with around 400,000 dialogues and a framework for generating personalized conversations using long-form journal entries from Reddit. Our approach clusters journal entries for each author and filters them by selecting the most representative cluster, ensuring that the retained entries best reflect the author's personality. We further refine the data by capturing the Big Five personality traits --openness, conscientiousness, extraversion, agreeableness, and neuroticism --ensuring that dialogues authentically reflect an individual's personality. Using Llama 3 70B, we generate high-quality, personality-rich dialogues grounded in these journal entries. Fine-tuning models on this dataset leads to an 11% improvement in capturing personality traits on average, outperforming existing approaches in generating more coherent and personality-driven dialogues.

large language model, machine learning, natural language, (20 more...)

2412.1125

Country:

North America > United States > Washington > King County > Seattle (0.14)
Asia > Singapore (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(17 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine (0.47)
Media (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)