Learning to Tokenize for Generative Retrieval Weiwei Sun

Neural Information Processing Systems 

As a new paradigm in information retrieval, generative retrieval directly generates a ranked list of document identifiers (docids) for a given query using generative language models (LMs). How to assign each document a unique docid (denoted as document tokenization) is a critical problem, because it determines whether the generative retrieval model can precisely retrieve any document by simply decoding its docid. Most existing methods adopt rule-based tokenization, which is ad-hoc and does not generalize well.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found