Evaluating BERTopic on Open-Ended Data: A Case Study with Belgian Dutch Daily Narratives
Kandala, Ratna, Vanhasbroeck, Niels, Hoemann, Katie
–arXiv.org Artificial Intelligence
While traditional probabilistic models such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) have been foundational, their underlying bag - of - words assumption limits their ability to capture complex semantics. A recent paradigm shift towards models like BERTopic (Grootendorst, 2022), a state - of - the - art (SOTA) model which leverages contextualized embeddings from pre - trained transformers, has shown significant promise in generating more semantically coherent topics. These models can capture nuanced relationships, including domain - speci fic named entities and morphologically rich constructs, critical for linguistically complex data. However, despite this progress, two significant gaps persist in literature. First, research has overwhelmingly focused on high - resource, standardized languages, with a lot of scope left for under - resourced languages to be unexplored. This focus not only limits the generalizability of existing models but also risks perp etuating a technological bias where the nuances of smaller linguistic communities are overlooked. Models trained on standard corpora often fail to capture the unique lexical and semantic patterns of regional dialects or sociolects, leading to a superficial or even inaccurate understanding of the underlying discourse (Kamilo g lu, 2025) . Second, the predominant application domain has been structured or short - form text like news articles or social media posts (Egger et al., 2022; Schäfer et al., 2024), while the challenges of modeling unstructured, open - ended personal narratives have received less attention. Distinct from the short - form, often decontextualized nature of social media data, daily narratives provide granular, contextually - grounded accounts of lived experience.
arXiv.org Artificial Intelligence
Nov-12-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.14)
- Europe
- Belgium > Flanders
- Flemish Brabant > Leuven (0.04)
- Netherlands > North Holland
- Amsterdam (0.04)
- Sweden > Vaestra Goetaland
- Gothenburg (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Belgium > Flanders
- North America > United States
- California (0.04)
- Kansas (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- New Jersey > Bergen County
- Mahwah (0.04)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.94)
- Industry:
- Health & Medicine (0.68)
- Information Technology > Security & Privacy (0.93)
- Technology: