German Text Embedding Clustering Benchmark

Wehrli, Silvan, Arnrich, Bert, Irrgang, Christopher

Jan-5-2024–arXiv.org Artificial Intelligence

This work introduces a benchmark assessing the performance of clustering German text embeddings in different domains. This benchmark is driven by the increasing use of clustering neural text embeddings in tasks that require the grouping of texts (such as topic modeling) and the need for German resources in existing benchmarks. We provide an initial analysis for a range of pre-trained mono- and multilingual models evaluated on the outcome of different clustering algorithms. Results include strong performing mono- and multilingual models. Reducing the dimensions of embeddings can further improve clustering. Additionally, we conduct experiments with continued pre-training for German BERT models to estimate the benefits of this additional training. Our experiments suggest that significant performance improvements are possible for short text. All code and datasets are publicly available.

computational linguistic, dataset, proceedings, (16 more...)

arXiv.org Artificial Intelligence

Jan-5-2024

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - Dominican Republic (0.04)
  - United States
    - New York > New York County
      - New York City (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
- Europe
  - Czechia > Prague (0.04)
  - Croatia (0.04)
  - United Kingdom > England
    - Staffordshire > Stoke-on-Trent (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Germany
    - Brandenburg > Potsdam (0.04)
    - Berlin (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
- Asia
  - China > Hong Kong (0.04)
  - Middle East
    - Jordan (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.14)
  - Japan
    - Kyūshū & Okinawa > Kyūshū
      - Miyazaki Prefecture > Miyazaki (0.04)
    - Honshū > Kantō
      - Tokyo Metropolis Prefecture > Tokyo (0.14)

Genre:
- Research Report (1.00)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology
  - Data Science > Data Mining (1.00)
  - Communications > Social Media (1.00)
  - Artificial Intelligence
    - Natural Language (1.00)
    - Machine Learning
      - Neural Networks (0.93)
      - Statistical Learning > Clustering (0.90)