FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation

Koloski, Boshko, Pollak, Senja, Navigli, Roberto, Škrlj, Blaž

Jul-10-2025–arXiv.org Artificial Intelligence

Efficient and rich document representations are the building blocks for many natural language processing (NLP) tasks such as classification or clustering [1]. Contemporary methods for representing documents focus on distilling representations from either pre-trained language models (PLMs) such as BERT [2] or large language models (LLMs) such as Llama3 [3], exploiting the rich semantic knowledge acquired during pre-training on vast text corpora. For instance, Sentence-BERT [4] builds document representation by pooling over pre-trained BERT-based word embeddings, which are further refined through contrastive learning and Siamese networks. Similarly, LLM2Vec [5] disentangles the causal masking of LLMs to a bi-directional one, further post-training the LLM on a masked next token prediction task and finally, training with a contrastive training objective, similarly to Sentence-BERT, refining the final representations via mean pooling by training with a contrastive training objective. Despite good performance on public benchmarks such as MTEB [1], contrastive pre-training models require acquiring a dataset of triplet sentences (i.e., query, positive answer, and negative answer), which is often infeasible and costly.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Jul-10-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- Asia (1.00)
- North America > United States
  - Minnesota (0.28)
  - California (0.28)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study > Negative Result (0.46)

Industry:
- Government (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found