colbert
Rethinking the Role of Token Retrieval in Multi-Vector Retrieval
Multi-vector retrieval models such as ColBERT [Khattab et al., 2020] allow token-level interactions between queries and documents, and hence achieve state of the art on many information retrieval benchmarks. However, their non-linear scoring function cannot be scaled to millions of documents, necessitating a three-stage process for inference: retrieving initial candidates via token retrieval, accessing all token vectors, and scoring the initial candidate documents. The non-linear scoring function is applied over all token vectors of each candidate document, making the inference process complicated and slow. In this paper, we aim to simplify the multi-vector retrieval by rethinking the role of token retrieval. We present XTR, ConteXtualized Token Retriever, which introduces a simple, yet novel, objective function that encourages the model to retrieve the most important document tokens first. The improvement to token retrieval allows XTR to rank candidates only using the retrieved tokens rather than all tokens in the document, and enables a newly designed scoring stage that is two-to-three orders of magnitude cheaper than that of ColBERT.
Appendices for Baleen A Data Details
Table 6: Sizes of the splits of the datasets used in this work. It contains approximately 5M passages (1.5 GiB uncompressed). We implement Baleen using Python 3.7 and PyTorch 1.6 and rely extensively on the HuggingFace We train and test with automatic mixed precision that is built into PyTorch. To train the single-hop retriever used to initiate the supervision procedure of 3.2, we follow the training strategy of Khattab et al. ColBERT model to create training triples, and then we train our retriever (in this case, FLIPR for first-hop) with these triples.
ModernBERT + ColBERT: Enhancing biomedical RAG through an advanced re-ranking retriever
Rivera, Eduardo Martínez, Menolascina, Filippo
Retrieval-Augmented Generation (RAG) is a powerful technique for enriching Large Language Models (LLMs) with external knowledge, allowing for factually grounded responses, a critical requirement in high-stakes domains such as healthcare. However, the efficacy of RAG systems is fundamentally restricted by the performance of their retrieval module, since irrelevant or semantically misaligned documents directly compromise the accuracy of the final generated response. General-purpose dense retrievers can struggle with the nuanced language of specialised domains, while the high accuracy of in-domain models is often achieved at prohibitive computational costs. In this work, we aim to address this trade-off by developing and evaluating a two-stage retrieval architecture that combines a lightweight ModernBERT bidirectional encoder for efficient initial candidate retrieval with a ColBERTv2 late-interaction model for fine-grained re-ranking. We conduct comprehensive evaluations of our retriever module performance and RAG system performance in the biomedical context, fine-tuning the IR module using 10k question-passage pairs from PubMedQA. Our analysis of the retriever module confirmed the positive impact of the ColBERT re-ranker, which improved Recall@3 by up to 4.2 percentage points compared to its retrieve-only counterpart. When integrated into the biomedical RAG, our IR module leads to a state-of-the-art average accuracy of 0.4448 on the five tasks of the MIRAGE question-answering benchmark, outperforming strong baselines such as MedCPT (0.4436). Our ablation studies reveal that this performance is critically dependent on a joint fine-tuning process that aligns the retriever and re-ranker; otherwise, the re-ranker might degrade the performance.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
- Europe > Switzerland (0.04)
- (3 more...)
Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos
Zhang, Haoyu, Zhang, Shihao, Colbert, Ian, Saab, Rayan
Post-training quantization (PTQ) has become a crucial tool for reducing the memory and compute costs of modern deep neural networks, including large language models (LLMs). Among PTQ algorithms, the OPTQ framework-also known as GPTQ-has emerged as a leading method due to its computational efficiency and strong empirical performance. Despite its widespread adoption, however, OPTQ lacks rigorous quantitative theoretical guarantees. This paper presents the first quantitative error bounds for both deterministic and stochastic variants of OPTQ, as well as for Qronos, a recent related state-of-the-art PTQ algorithm. We analyze how OPTQ's iterative procedure induces quantization error and derive non-asymptotic 2-norm error bounds that depend explicitly on the calibration data and a regularization parameter that OPTQ uses. Our analysis provides theoretical justification for several practical design choices, including the widely used heuristic of ordering features by decreasing norm, as well as guidance for selecting the regularization parameter. For the stochastic variant, we establish stronger infinity-norm error bounds, which enable control over the required quantization alphabet and are particularly useful for downstream layers and nonlinearities. Finally, we extend our analysis to Qronos, providing new theoretical bounds, for both its deterministic and stochastic variants, that help explain its empirical advantages.
Extracting Document Relations from Search Corpus by Marginalizing over User Queries
Iwamoto, Yuki, Tsunoda, Kaoru, Kaneiwa, Ken
Understanding relationships between documents in large-scale corpora is essential for knowledge discovery and information organization. However, existing approaches rely heavily on manual annotation or predefined relationship taxonomies. W e propose EDR-MQ (Extracting Document Relations by Marginalizing over User Queries), a novel framework that discovers document relationships through query marginalization. EDR-MQ is based on the insight that strongly related documents often co-occur in results across diverse user queries, enabling us to estimate joint probabilities between document pairs by marginalizing over a collection of queries. T o enable this query marginalization approach, we develop Multiply Conditioned Retrieval-Augmented Generation (MC-RAG), which employs conditional retrieval where subsequent document retrievals depend on previously retrieved content. By observing co-occurrence patterns across diverse queries, EDR-MQ estimates joint probabilities between document pairs without requiring labeled training data or predefined taxonomies. Experimental results show that our query marginalization approach successfully identifies meaningful document relationships, revealing topical clusters, evidence chains, and cross-domain connections that are not apparent through traditional similarity-based methods. Our query-driven framework offers a practical approach to document organization that adapts to different user perspectives and information needs.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Asia > China (0.04)
A model and package for German ColBERT
The original ColBERT model was proposed by Khattab and Zaharia [8 ], introducing the MaxSim scoring function based on token-level intera ctions. The model was trained using a softmax cross-entropy loss over triplet s derived from the MS MARCO Ranking [1] and TREC Complex Answer Retrieval (TREC CAR) [5] datasets, leveraging the English BERT model [4] as its backb one encoder. The ColBERT MaxSim score can be interpreted as a substitut e for the BM25 score used in full-text search; consequently, there are simila rities between the ColBERT retrieval method and BM25-based full-text search. T his will be discussed in detail in Section 2. ColBERT is flexible, and can be used as a first retrieval method or a reranker. ColBERT score is computed o n the token similarity level, and can be applied in contexts where keyword similarities are significant. ColBERT model was also trained for Japanese [3] where the author a lso discussed different strategies to choose hard negatives using mult ilingual e5 embedding model and BM25.
Towards Lossless Token Pruning in Late-Interaction Retrieval Models
Zong, Yuxuan, Piwowarski, Benjamin
Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks. However, they require a huge memory space to store the contextual representation for all the document tokens. Some works have proposed using either heuristics or statistical-based techniques to prune tokens from each document. This however doesn't guarantee that the removed tokens have no impact on the retrieval score. Our work uses a principled approach to define how to prune tokens without impacting the score between a document and a query. We introduce three regularization losses, that induce a solution with high pruning ratios, as well as two pruning strategies. We study them experimentally (in and out-domain), showing that we can preserve ColBERT's performance while using only 30\% of the tokens.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.28)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New York > New York County > New York City (0.05)
- (22 more...)
You've Seen This Bizarre Video Phenomenon. There's a Reason It's Suddenly Everywhere.
Sign up for the Slatest to get the most insightful analysis, criticism, and advice out there, delivered to your inbox daily. Imagine yourself strapped to a chair with your head held in place by some device. The only thing you're free to move is your eyes. You hear something to your left; you'd want to turn your head left to look, or at least take a sidelong glance. Your brain sends the necessary impulses to your muscles--only you can't move.
- Leisure & Entertainment (1.00)
- Media > Film (0.48)
Retrieval Augmented Spelling Correction for E-Commerce Applications
Guo, Xuan, Patki, Rohit, Everaert, Dante, Potts, Christopher
The rapid introduction of new brand names into everyday language poses a unique challenge for e-commerce spelling correction services, which must distinguish genuine misspellings from novel brand names that use unconventional spelling. We seek to address this challenge via Retrieval Augmented Generation (RAG). On this approach, product names are retrieved from a catalog and incorporated into the context used by a large language model (LLM) that has been fine-tuned to do contextual spelling correction. Through quantitative evaluation and qualitative error analyses, we find improvements in spelling correction utilizing the RAG framework beyond a stand-alone LLM. We also demonstrate the value of additional finetuning of the LLM to incorporate retrieved context.
- North America > United States > New York > New York County > New York City (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- (6 more...)
Rethinking the Role of Token Retrieval in Multi-Vector Retrieval
Multi-vector retrieval models such as ColBERT [Khattab et al., 2020] allow token-level interactions between queries and documents, and hence achieve state of the art on many information retrieval benchmarks. However, their non-linear scoring function cannot be scaled to millions of documents, necessitating a three-stage process for inference: retrieving initial candidates via token retrieval, accessing all token vectors, and scoring the initial candidate documents. The non-linear scoring function is applied over all token vectors of each candidate document, making the inference process complicated and slow. In this paper, we aim to simplify the multi-vector retrieval by rethinking the role of token retrieval. We present XTR, ConteXtualized Token Retriever, which introduces a simple, yet novel, objective function that encourages the model to retrieve the most important document tokens first.