Hudson, G Thomas
RAR-b: Reasoning as Retrieval Benchmark
Xiao, Chenghao, Hudson, G Thomas, Moubayed, Noura Al
Semantic textual similartiy (STS) and information retrieval tasks (IR) tasks have been the two major avenues to record the progress of embedding models in the past few years. Under the emerging Retrieval-augmented Generation (RAG) paradigm, we envision the need to evaluate next-level language understanding abilities of embedding models, and take a conscious look at the reasoning abilities stored in them. Addressing this, we pose the question: Can retrievers solve reasoning problems? By transforming reasoning tasks into retrieval tasks, we find that without specifically trained for reasoning-level language understanding, current state-of-the-art retriever models may still be far from being competent for playing the role of assisting LLMs, especially in reasoning-intensive tasks. Moreover, albeit trained to be aware of instructions, instruction-aware IR models are often better off without instructions in inference time for reasoning tasks, posing an overlooked retriever-LLM behavioral gap for the research community to align. However, recent decoder-based embedding models show great promise in narrowing the gap, highlighting the pathway for embedding models to achieve reasoning-level language understanding. We also show that, although current off-the-shelf re-ranker models fail on these tasks, injecting reasoning abilities into them through fine-tuning still appears easier than doing so to bi-encoders, and we are able to achieve state-of-the-art performance across all tasks by fine-tuning a reranking model. We release Reasoning as Retrieval Benchmark (RAR-b), a holistic suite of tasks and settings to evaluate the reasoning abilities stored in retriever models. RAR-b is available at https://github.com/gowitheflow-1998/RAR-b.
Pixel Sentence Representation Learning
Xiao, Chenghao, Huang, Zhuoxu, Chen, Danlu, Hudson, G Thomas, Li, Yizhi, Duan, Haoran, Lin, Chenghua, Fu, Jie, Han, Jungong, Moubayed, Noura Al
Vanilla language models are long known to have subpar sentence-level representation (Reimers and Gurevych, 2019; Wang et al., 2023), even worse than averaging static word embeddings (Pennington et al., 2014), i.e., sentence representations attained by pooling from sub-word embeddings encoded by language models do not closely reflect the relative semantics of sentences. Encouraged by the remarkable success of visual representation learning facilitated by unsupervised contrastive learning (Chen et al., 2020; He et al., 2020), efforts in NLP are made to leverage unsupervised contrastive learning to recover sentence-level encoding abilities from the models (Fang et al., 2020; Wu et al., 2020; Gao et al., 2021; Meng et al., 2021). However, translating the advancements in visual representation learning to learning sentence-level textual semantics presents unique challenges: a single augmentation (Wu et al., 2020; Meng et al., 2021) might alter the meaning of a sentence, posing problems of the validity of the augmented sentence as a positive pair. Such attempts are primarily bottlenecked by the discreteness of subword units brought by tokenization (Sennrich et al., 2016), impeding the creation of continuous unsupervised semantic pairs that have preserved semantics through small perturbations to inputs. Therefore, the most recognized unsupervised sentence representation learning method in NLP applies two dropout masks to the identical input to attain two representations, as positive pairs in contrastive learning (Gao et al., 2021). We argue that using identical inputs confines the method of Gao et al. (2021) to essentially only a way to improve uniformity (Wang and Isola, 2020) by distancing negative examples that are not identical to an instance itself, lacking
Length is a Curse and a Blessing for Document-level Semantics
Xiao, Chenghao, Li, Yizhi, Hudson, G Thomas, Lin, Chenghua, Moubayed, Noura Al
In recent years, contrastive learning (CL) has been extensively utilized to recover sentence and document-level encoding capability from pre-trained language models. In this work, we question the length generalizability of CL-based models, i.e., their vulnerability towards length-induced semantic shift. We verify not only that length vulnerability is a significant yet overlooked research gap, but we can devise unsupervised CL methods solely depending on the semantic signal provided by document length. We first derive the theoretical foundations underlying length attacks, showing that elongating a document would intensify the high intra-document similarity that is already brought by CL. Moreover, we found that isotropy promised by CL is highly dependent on the length range of text exposed in training. Inspired by these findings, we introduce a simple yet universal document representation learning framework, LA(SER)$^{3}$: length-agnostic self-reference for semantically robust sentence representation learning, achieving state-of-the-art unsupervised performance on the standard information retrieval benchmark.
MuLD: The Multitask Long Document Benchmark
Hudson, G Thomas, Moubayed, Noura Al
The impressive progress in NLP techniques has been driven by the development of multi-task benchmarks such as GLUE and SuperGLUE. While these benchmarks focus on tasks for one or two input sentences, there has been exciting work in designing efficient techniques for processing much longer inputs. In this paper, we present MuLD: a new long document benchmark consisting of only documents over 10,000 tokens. By modifying existing NLP tasks, we create a diverse benchmark which requires models to successfully model long-term dependencies in the text. We evaluate how existing models perform, and find that our benchmark is much more challenging than their `short document' equivalents. Furthermore, by evaluating both regular and efficient transformers, we show that models with increased context length are better able to solve the tasks presented, suggesting that future improvements in these models are vital for solving similar long document problems. We release the data and code for baselines to encourage further research on efficient NLP models.