A model and package for German ColBERT

Apr-30-2025–arXiv.org Artificial Intelligence

The original ColBERT model was proposed by Khattab and Zaharia [8 ], introducing the MaxSim scoring function based on token-level intera ctions. The model was trained using a softmax cross-entropy loss over triplet s derived from the MS MARCO Ranking [1] and TREC Complex Answer Retrieval (TREC CAR) [5] datasets, leveraging the English BERT model [4] as its backb one encoder. The ColBERT MaxSim score can be interpreted as a substitut e for the BM25 score used in full-text search; consequently, there are simila rities between the ColBERT retrieval method and BM25-based full-text search. T his will be discussed in detail in Section 2. ColBERT is flexible, and can be used as a first retrieval method or a reranker. ColBERT score is computed o n the token similarity level, and can be applied in contexts where keyword similarities are significant. ColBERT model was also trained for Japanese [3] where the author a lso discussed different strategies to choose hard negatives using mult ilingual e5 embedding model and BM25.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Apr-30-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning (0.88)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found