bertopic
A Reproducible Framework for Neural Topic Modeling in Focus Group Analysis
Arfaoui, Heger, Hergli, Mohammed Iheb, Benzina, Beya, BenMiled, Slimane
Focus group discussions generate rich qualitative data but their analysis traditionally relies on labor-intensive manual coding that limits scalability and reproducibility. We present a systematic framework for applying BERTopic to focus group transcripts using data from ten focus groups exploring HPV vaccine perceptions in Tunisia (1,075 utterances). We conducted comprehensive hyperparameter exploration across 27 configurations, evaluating each through bootstrap stability analysis, performance metrics, and comparison with LDA baseline. Bootstrap analysis revealed that stability metrics (NMI and ARI) exhibited strong disagreement (r = -0.691) and showed divergent relationships with coherence, demonstrating that stability is multifaceted rather than monolithic. Our multi-criteria selection framework yielded a 7-topic model achieving 18\% higher coherence than optimized LDA (0.573 vs. 0.486) with interpretable topics validated through independent human evaluation (ICC = 0.700, weighted Cohen's kappa = 0.678). These findings demonstrate that transformer-based topic modeling can extract interpretable themes from small focus group transcript corpora when systematically configured and validated, while revealing that quality metrics capture distinct, sometimes conflicting constructs requiring multi-criteria evaluation. We provide complete documentation and code to support reproducibility.
- Africa > Middle East > Tunisia > Tunis Governorate > Tunis (0.05)
- Asia > Middle East > Jordan (0.04)
- Europe > Switzerland (0.04)
- Europe > Poland (0.04)
- Research Report > New Finding (1.00)
- Questionnaire & Opinion Survey (1.00)
- Health & Medicine > Therapeutic Area > Vaccines (1.00)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
- Health & Medicine > Therapeutic Area > Immunology (1.00)
Evaluating BERTopic on Open-Ended Data: A Case Study with Belgian Dutch Daily Narratives
Kandala, Ratna, Vanhasbroeck, Niels, Hoemann, Katie
While traditional probabilistic models such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) have been foundational, their underlying bag - of - words assumption limits their ability to capture complex semantics. A recent paradigm shift towards models like BERTopic (Grootendorst, 2022), a state - of - the - art (SOTA) model which leverages contextualized embeddings from pre - trained transformers, has shown significant promise in generating more semantically coherent topics. These models can capture nuanced relationships, including domain - speci fic named entities and morphologically rich constructs, critical for linguistically complex data. However, despite this progress, two significant gaps persist in literature. First, research has overwhelmingly focused on high - resource, standardized languages, with a lot of scope left for under - resourced languages to be unexplored. This focus not only limits the generalizability of existing models but also risks perp etuating a technological bias where the nuances of smaller linguistic communities are overlooked. Models trained on standard corpora often fail to capture the unique lexical and semantic patterns of regional dialects or sociolects, leading to a superficial or even inaccurate understanding of the underlying discourse (Kamilo g lu, 2025) . Second, the predominant application domain has been structured or short - form text like news articles or social media posts (Egger et al., 2022; Schäfer et al., 2024), while the challenges of modeling unstructured, open - ended personal narratives have received less attention. Distinct from the short - form, often decontextualized nature of social media data, daily narratives provide granular, contextually - grounded accounts of lived experience.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
- (7 more...)
- Information Technology > Security & Privacy (0.93)
- Health & Medicine (0.68)
Modeling Political Discourse with Sentence-BERT and BERTopic
Mendonca, Margarida, Figueira, Alvaro
Social media has reshaped political discourse, offering politicians a platform for direct engagement while reinforcing polarization and ideological divides. This study introduces a novel topic evolution framework that integrates BERTopic-based topic modeling with Moral Foundations Theory (MFT) to analyze the longevity and moral dimensions of political topics in Twitter activity during the 117th U.S. Congress. We propose a methodology for tracking dynamic topic shifts over time and measuring their association with moral values and quantifying topic persistence. Our findings reveal that while overarching themes remain stable, granular topics tend to dissolve rapidly, limiting their long-term influence. Moreover, moral foundations play a critical role in topic longevity, with Care and Loyalty dominating durable topics, while partisan differences manifest in distinct moral framing strategies. This work contributes to the field of social network analysis and computational political discourse by offering a scalable, interpretable approach to understanding moral-driven topic evolution on social media.
- Asia > Middle East > Israel (0.05)
- Europe > Ukraine (0.05)
- Europe > Russia (0.05)
- (7 more...)
- Information Technology > Services (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Government > Voting & Elections (0.93)
Improving Topic Modeling of Social Media Short Texts with Rephrasing: A Case Study of COVID-19 Related Tweets
Xin, Wangjiaxuan, Yin, Shuhua, Chen, Shi, Ge, Yaorong
Social media platforms such as Twitter (now X) provide rich data for analyzing public discourse, especially during crises such as the COVID-19 pandemic. However, the brevity, informality, and noise of social media short texts often hinder the effectiveness of traditional topic modeling, producing incoherent or redundant topics that are often difficult to interpret. To address these challenges, we have developed \emph{TM-Rephrase}, a model-agnostic framework that leverages large language models (LLMs) to rephrase raw tweets into more standardized and formal language prior to topic modeling. Using a dataset of 25,027 COVID-19-related Twitter posts, we investigate the effects of two rephrasing strategies, general- and colloquial-to-formal-rephrasing, on multiple topic modeling methods. Results demonstrate that \emph{TM-Rephrase} improves three metrics measuring topic modeling performance (i.e., topic coherence, topic uniqueness, and topic diversity) while reducing topic redundancy of most topic modeling algorithms, with the colloquial-to-formal strategy yielding the greatest performance gains and especially for the Latent Dirichlet Allocation (LDA) algorithm. This study contributes to a model-agnostic approach to enhancing topic modeling in public health related social media analysis, with broad implications for improved understanding of public discourse in health crisis as well as other important domains.
- Europe > Austria > Vienna (0.14)
- North America > United States > North Carolina > Mecklenburg County > Charlotte (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- (3 more...)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
- Health & Medicine > Therapeutic Area > Immunology (1.00)
- Health & Medicine > Epidemiology (1.00)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
AI and analytics in sports: Leveraging BERTopic to map the past and chart the future
Purpose: The purpose of this study is to map the body of scholarly literature at the intersection of artificial intelligence (AI), analytics and sports and thereafter, leverage the insights generated to chart guideposts for future research. Design/methodology/approach: The study carries out systematic literature review (SLR). Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) protocol is leveraged to identify 204 journal articles pertaining to utilization of AI and analytics in sports published during 2002 to 2024. We follow it up with extraction of the latent topics from sampled articles by leveraging the topic modelling technique of BERTopic. Findings: The study identifies the following as predominant areas of extant research on usage of AI and analytics in sports: performance modelling, physical and mental health, social media sentiment analysis, and tactical tracking. Each extracted topic is further examined in terms of its relative prominence, representative studies, and key term associations. Drawing on these insights, the study delineates promising avenues for future inquiry. Research limitations/implications: The study offers insights to academicians and sports administrators on transformational impact of AI and analytics in sports. Originality/value: The study introduces BERTopic as a novel approach for extracting latent structures in sports research, thereby advancing both scholarly understanding and the methodological toolkit of the field.
- Europe > Switzerland > Basel-City > Basel (0.04)
- Asia > India > Odisha (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (4 more...)
- Research Report (1.00)
- Overview (1.00)
- Leisure & Entertainment > Sports > Soccer (1.00)
- Health & Medicine > Consumer Health (1.00)
- Education (0.94)
- (2 more...)
ProtoTopic: Prototypical Network for Few-Shot Medical Topic Modeling
Licht, Martin, Ketabi, Sara, Khalvati, Farzad
Topic modeling is a useful tool for analyzing large corpora of written documents, particularly academic papers. Despite a wide variety of proposed topic modeling techniques, these techniques do not perform well when applied to medical texts. This can be due to the low number of documents available for some topics in the healthcare domain. In this paper, we propose ProtoTopic, a prototypical network-based topic model used for topic generation for a set of medical paper abstracts. Prototypical networks are efficient, explainable models that make predictions by computing distances between input datapoints and a set of prototype representations, making them particularly effective in low-data or few-shot learning scenarios. With ProtoTopic, we demonstrate improved topic coherence and diversity compared to two topic modeling baselines used in the literature, demonstrating the ability of our model to generate medically relevant topics even with limited data.
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Question-Driven Analysis and Synthesis: Building Interpretable Thematic Trees with LLMs for Text Clustering and Controllable Generation
Unsupervised analysis of text corpora is challenging, especially in data-scarce domains where traditional topic models struggle. While these models offer a solution, they typically describe clusters with lists of keywords that require significant manual effort to interpret and often lack semantic coherence. To address this critical interpretability gap, we introduce Recursive Thematic Partitioning (RTP), a novel framework that leverages Large Language Models (LLMs) to interactively build a binary tree. Each node in the tree is a natural language question that semantically partitions the data, resulting in a fully interpretable taxonomy where the logic of each cluster is explicit. Our experiments demonstrate that RTP's question-driven hierarchy is more interpretable than the keyword-based topics from a strong baseline like BERTopic. Furthermore, we establish the quantitative utility of these clusters by showing they serve as powerful features in downstream classification tasks, particularly when the data's underlying themes correlate with the task labels. RTP introduces a new paradigm for data exploration, shifting the focus from statistical pattern discovery to knowledge-driven thematic analysis. Furthermore, we demonstrate that the thematic paths from the RTP tree can serve as structured, controllable prompts for generative models. This transforms our analytical framework into a powerful tool for synthesis, enabling the consistent imitation of specific characteristics discovered in the source corpus.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > Jordan (0.04)
- South America > Chile (0.04)
- (10 more...)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
- Leisure & Entertainment (0.93)
- Media > Film (0.68)
Investigating Thematic Patterns and User Preferences in LLM Interactions using BERTopic
Bhandarkar, Abhay, Mishra, Gaurav, Juchani, Khushi, Singhal, Harsh
This study applies BERTopic, a transformer-based topic modeling technique, to the lmsys-chat-1m dataset, a multilingual conversational corpus built from head-to-head evaluations of large language models (LLMs). Each user prompt is paired with two anonymized LLM responses and a human preference label, used to assess user evaluation of competing model outputs. The main objective is uncovering thematic patterns in these conversations and examining their relation to user preferences, particularly if certain LLMs are consistently preferred within specific topics. A robust preprocessing pipeline was designed for multilingual variation, balancing dialogue turns, and cleaning noisy or redacted data. BERTopic extracted over 29 coherent topics including artificial intelligence, programming, ethics, and cloud infrastructure. We analysed relationships between topics and model preferences to identify trends in model-topic alignment. Visualization techniques included inter-topic distance maps, topic probability distributions, and model-versus-topic matrices. Our findings inform domain-specific fine-tuning and optimization strategies for improving real-world LLM performance and user satisfaction.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)
LLM-Assisted Topic Reduction for BERTopic on Social Media Data
Janssens, Wannes, Bogaert, Matthias, Poel, Dirk Van den
The BERTopic framework leverages transformer embeddings and hierarchical clustering to extract latent topics from unstructured text corpora. While effective, it often struggles with social media data, which tends to be noisy and sparse, resulting in an excessive number of overlapping topics. Recent work explored the use of large language models for end-to-end topic modelling. However, these approaches typically require significant computational overhead, limiting their scalability in big data contexts. In this work, we propose a framework that combines BERTopic for topic generation with large language models for topic reduction. The method first generates an initial set of topics and constructs a representation for each. These representations are then provided as input to the language model, which iteratively identifies and merges semantically similar topics. We evaluate the approach across three Twitter/X datasets and four different language models. Our method outperforms the baseline approach in enhancing topic diversity and, in many cases, coherence, with some sensitivity to dataset characteristics and initial parameter selection.
Uncovering AI Governance Themes in EU Policies using BERTopic and Thematic Analysis
Golpayegani, Delaram, Lasek-Markey, Marta, Younus, Arjumand, Kerr, Aphra, Lewis, Dave
The upsurge of policies and guidelines that aim to ensure Artificial Intelligence (AI) systems are safe and trustworthy has led to a fragmented landscape of AI governance. The European Union (EU) is a key actor in the development of such policies and guidelines. Its High-Level Expert Group (HLEG) issued an influential set of guidelines for trustworthy AI, followed in 2024 by the adoption of the EU AI Act. While the EU policies and guidelines are expected to be aligned, they may differ in their scope, areas of emphasis, degrees of normativity, and priorities in relation to AI. To gain a broad understanding of AI governance from the EU perspective, we leverage qualitative thematic analysis approaches to uncover prevalent themes in key EU documents, including the AI Act and the HLEG Ethics Guidelines. We further employ quantitative topic modelling approaches, specifically through the use of the BERTopic model, to enhance the results and increase the document sample to include EU AI policy documents published post-2018. We present a novel perspective on EU policies, tracking the evolution of its approach to addressing AI governance.
- Asia > China (0.15)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.14)
- North America > United States (0.14)
- (2 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government > Regional Government > Europe Government (0.91)