Hierarchical Level-Wise News Article Clustering via Multilingual Matryoshka Embeddings
Hanley, Hans W. A., Durumeric, Zakir
–arXiv.org Artificial Intelligence
Contextual large language model embeddings are increasingly utilized for topic modeling and clustering. However, current methods often scale poorly, rely on opaque similarity metrics, and struggle in multilingual settings. In this work, we present a novel, scalable, interpretable, hierarchical, and multilingual approach to clustering news articles and social media data. To do this, we first train multilingual Matryoshka embeddings that can determine story similarity at varying levels of granularity based on which subset of the dimensions of the embeddings is examined. This embedding model achieves state-of-the-art performance on the SemEval 2022 Task 8 test dataset (Pearson $ρ$ = 0.816). Once trained, we develop an efficient hierarchical clustering algorithm that leverages the hierarchical nature of Matryoshka embeddings to identify unique news stories, narratives, and themes. We conclude by illustrating how our approach can identify and cluster stories, narratives, and overarching themes within real-world news datasets.
arXiv.org Artificial Intelligence
Jun-3-2025
- Country:
- Africa > Nigeria
- Kogi State (0.04)
- Asia
- China (0.04)
- India (0.04)
- Malaysia (0.04)
- Middle East
- Israel (0.04)
- Jordan (0.04)
- Palestine > Gaza Strip
- Gaza Governorate > Gaza (0.04)
- Republic of Türkiye (0.04)
- Syria > Damascus Governorate
- Damascus (0.04)
- North Korea (0.28)
- Russia (0.14)
- Europe
- Russia (0.04)
- Ukraine (0.14)
- United Kingdom (0.14)
- North America > United States
- California > Santa Clara County
- Palo Alto (0.04)
- Nevada > Clark County
- Las Vegas (0.04)
- New York > Rockland County
- Monsey (0.14)
- California > Santa Clara County
- Africa > Nigeria
- Genre:
- Overview (0.68)
- Research Report (1.00)
- Industry:
- Government
- Information Technology (1.00)
- Leisure & Entertainment (0.93)
- Media > News (1.00)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning
- Neural Networks > Deep Learning (0.94)
- Statistical Learning > Clustering (1.00)
- Natural Language
- Chatbot (0.94)
- Large Language Model (1.00)
- Text Processing (1.00)
- Representation & Reasoning (1.00)
- Machine Learning
- Communications (1.00)
- Data Science > Data Mining (1.00)
- Artificial Intelligence
- Information Technology