A model and package for German ColBERT

Dang, Thuong, Chen, Qiqi

arXiv.org Artificial Intelligence 

The original ColBERT model was proposed by Khattab and Zaharia [8 ], introducing the MaxSim scoring function based on token-level intera ctions. The model was trained using a softmax cross-entropy loss over triplet s derived from the MS MARCO Ranking [1] and TREC Complex Answer Retrieval (TREC CAR) [5] datasets, leveraging the English BERT model [4] as its backb one encoder. The ColBERT MaxSim score can be interpreted as a substitut e for the BM25 score used in full-text search; consequently, there are simila rities between the ColBERT retrieval method and BM25-based full-text search. T his will be discussed in detail in Section 2. ColBERT is flexible, and can be used as a first retrieval method or a reranker. ColBERT score is computed o n the token similarity level, and can be applied in contexts where keyword similarities are significant. ColBERT model was also trained for Japanese [3] where the author a lso discussed different strategies to choose hard negatives using mult ilingual e5 embedding model and BM25.