topic distribution
- Asia > Middle East > Jordan (0.05)
- North America > Canada (0.04)
- Europe > Portugal > Lisbon > Lisbon (0.04)
Empathy Level Prediction in Multi-Modal Scenario with Supervisory Documentation Assistance
Abstract--Prevalent empathy prediction techniques primarily concentrate on a singular modality, typically textual, thus neglecting multi-modal processing capabilities. They also overlook the utilization of certain privileged information, which may encompass additional empathetic content. In response, we introduce an advanced multi-modal empathy prediction method integrating video, audio, and text information. The method comprises the Multi-Modal Empathy Prediction and Supervisory Documentation Assisted Training. We use pre-trained networks in the empathy prediction network to extract features from various modalities, followed by a cross-modal fusion. This process yields a multi-modal feature representation, which is employed to predict empathy labels. T o enhance the extraction of text features, we incorporate supervisory documents as privileged information during the assisted training phase. Specifically, we apply the Latent Dirichlet Allocation model to identify potential topic distributions to constrain text features. These supervisory documents, created by supervisors, focus on the counseling topics and the counselor's display of empathy. Notably, this privileged information is only available during training and is not accessible during the prediction phase. Experimental results on the multi-modal and dialogue empathy datasets demonstrate that our approach is superior to the existing methods. MP A THY is characterized by an emotional response that arises from the interplay between inherent traits and situational factors. This empathetic response is spontaneously triggered, yet it is also sculpted by deliberate cognitive control.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Vision (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)
Quantifying consistency and accuracy of Latent Dirichlet Allocation
Magsarjav, Saranzaya, Humphries, Melissa, Tuke, Jonathan, Mitchell, Lewis
Topic modelling in Natural Language Processing uncovers hidden topics in large, unlabelled text datasets. It is widely applied in fields such as information retrieval, content summarisation, and trend analysis across various disciplines. However, probabilistic topic models can produce different results when rerun due to their stochastic nature, leading to inconsistencies in latent topics. Factors like corpus shuffling, rare text removal, and document elimination contribute to these variations. This instability affects replicability, reliability, and interpretation, raising concerns about whether topic models capture meaningful topics or just noise. To address these problems, we defined a new stability measure that incorporates accuracy and consistency and uses the generative properties of LDA to generate a new corpus with ground truth. These generated corpora are run through LDA 50 times to determine the variability in the output. We show that LDA can correctly determine the underlying number of topics in the documents. We also find that LDA is more internally consistent, as the multiple reruns return similar topics; however, these topics are not the true topics.
- Oceania > Australia > South Australia > Adelaide (0.04)
- Europe > Middle East > Malta > Port Region > Southern Harbour District > Floriana (0.04)
Improving Romanian LLM Pretraining Data using Diversity and Quality Filtering
Negoita, Vlad, Masala, Mihai, Rebedea, Traian
Large Language Models (LLMs) have recently exploded in popularity, often matching or outperforming human abilities on many tasks. One of the key factors in training LLMs is the availability and curation of high-quality data. Data quality is especially crucial for under-represented languages, where high-quality corpora are scarce. In this work we study the characteristics and coverage of Romanian pretraining corpora and we examine how they differ from English data. By training a lightweight multitask model on carefully LLM-annotated Romanian texts, we are able to analyze and perform multi-level filtering (e.g., educational value, topic, format) to generate high-quality pretraining datasets. Our experiments show noteworthy trends in the topics present in Romanian and English data, while also proving the effectiveness of filtering data through improved LLM pretraining performance across multiple benchmarks.
- Europe > Romania > București - Ilfov Development Region > Municipality of Bucharest > Bucharest (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > Singapore (0.05)
- Asia > Middle East > Jordan (0.04)
- Asia > Middle East > Jordan (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Portugal > Lisbon > Lisbon (0.04)
NGTM: Substructure-based Neural Graph Topic Model for Interpretable Graph Generation
Zhuang, Yuanxin, Shen, Dazhong, Sun, Ying
Graph generation plays a pivotal role across numerous domains, including molecular design and knowledge graph construction. Although existing methods achieve considerable success in generating realistic graphs, their interpretability remains limited, often obscuring the rationale behind structural decisions. To address this challenge, we propose the Neural Graph Topic Model (NGTM), a novel generative framework inspired by topic modeling in natural language processing. NGTM represents graphs as mixtures of latent topics, each defining a distribution over semantically meaningful substructures, which facilitates explicit interpretability at both local and global scales. The generation process transparently integrates these topic distributions with a global structural variable, enabling clear semantic tracing of each generated graph. Experiments demonstrate that NGTM achieves competitive generation quality while uniquely enabling fine-grained control and interpretability, allowing users to tune structural features or induce biological properties through topic-level adjustments.
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Hong Kong (0.04)
Towards Leveraging Large Language Model Summaries for Topic Modeling in Source Code
Carissimi, Michele, Saletta, Martina, Ferretti, Claudio
Understanding source code is a topic of great interest in the software engineering community, since it can help programmers in various tasks such as software maintenance and reuse. Recent advances in large language models (LLMs) have demonstrated remarkable program comprehension capabilities, while transformer-based topic modeling techniques offer effective ways to extract semantic information from text. This paper proposes and explores a novel approach that combines these strengths to automatically identify meaningful topics in a corpus of Python programs. Our method consists in applying topic modeling on the descriptions obtained by asking an LLM to summarize the code. To assess the internal consistency of the extracted topics, we compare them against topics inferred from function names alone, and those derived from existing docstrings. Experimental results suggest that leveraging LLM-generated summaries provides interpretable and semantically rich representation of code structure. The promising results suggest that our approach can be fruitfully applied in various software engineering tasks such as automatic documentation and tagging, code search, software reorganization and knowledge discovery in large repositories.
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.05)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.05)
- Europe > Italy > Lombardy > Milan (0.04)
- (3 more...)
Robust Bayesian Max-Margin Clustering
Changyou Chen, Jun Zhu, Xinhua Zhang
We present max-margin Bayesian clustering (BMC), a general and robust framework that incorporates the max-margin criterion into Bayesian clustering models, as well as two concrete models of BMC to demonstrate its flexibility and effectiveness in dealing with different clustering tasks. The Dirichlet process max-margin Gaussian mixture is a nonparametric Bayesian clustering model that relaxes the underlying Gaussian assumption of Dirichlet process Gaussian mixtures by incorporating max-margin posterior constraints, and is able to infer the number of clusters from data. We further extend the ideas to present max-margin clustering topic model, which can learn the latent topic representation of each document while at the same time cluster documents in the max-margin fashion. Extensive experiments are performed on a number of real datasets, and the results indicate superior clustering performance of our methods compared to related baselines.
- Asia > Middle East > Jordan (0.04)
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- North America > United States > North Carolina > Durham County > Durham (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Exploring Text Representations for Online Misinformation
Mis- and disinformation, commonly collectively called fake news, continue to menace society. Perhaps, the impact of this age-old problem is presently most plain in politics and healthcare. However, fake news is affecting an increasing number of domains. It takes many different forms and continues to shapeshift as technology advances. Though it arguably most widely spreads in textual form, e.g., through social media posts and blog articles. Thus, it is imperative to thwart the spread of textual misinformation, which necessitates its initial detection. This thesis contributes to the creation of representations that are useful for detecting misinformation. Firstly, it develops a novel method for extracting textual features from news articles for misinformation detection. These features harness the disparity between the thematic coherence of authentic and false news stories. In other words, the composition of themes discussed in both groups significantly differs as the story progresses. Secondly, it demonstrates the effectiveness of topic features for fake news detection, using classification and clustering. Clustering is particularly useful because it alleviates the need for a labelled dataset, which can be labour-intensive and time-consuming to amass. More generally, it contributes towards a better understanding of misinformation and ways of detecting it using Machine Learning and Natural Language Processing.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > France (0.04)
- Europe > Russia (0.04)
- (19 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Research Report > Promising Solution (0.65)
- Media > News (1.00)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
- Health & Medicine > Therapeutic Area > Immunology (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)