FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation

Koloski, Boshko, Pollak, Senja, Navigli, Roberto, Škrlj, Blaž

arXiv.org Artificial Intelligence 

Efficient and rich document representations are the building blocks for many natural language processing (NLP) tasks such as classification or clustering [1]. Contemporary methods for representing documents focus on distilling representations from either pre-trained language models (PLMs) such as BERT [2] or large language models (LLMs) such as Llama3 [3], exploiting the rich semantic knowledge acquired during pre-training on vast text corpora. For instance, Sentence-BERT [4] builds document representation by pooling over pre-trained BERT-based word embeddings, which are further refined through contrastive learning and Siamese networks. Similarly, LLM2Vec [5] disentangles the causal masking of LLMs to a bi-directional one, further post-training the LLM on a masked next token prediction task and finally, training with a contrastive training objective, similarly to Sentence-BERT, refining the final representations via mean pooling by training with a contrastive training objective. Despite good performance on public benchmarks such as MTEB [1], contrastive pre-training models require acquiring a dataset of triplet sentences (i.e., query, positive answer, and negative answer), which is often infeasible and costly.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found