Craswell, Nick
Conformer-Kernel with Query Term Independence at TREC 2020 Deep Learning Track
Mitra, Bhaskar, Hofstatter, Sebastian, Zamani, Hamed, Craswell, Nick
The Conformer-Kernel (CK) model [Mitra et al., 2020] builds upon the Transformer-Kernel (TK) [Hofstätter et al., 2019] architecture, that demonstrated strong competitive performance compared to BERTbased [Devlin et al., 2019] ranking methods, but notably at a fraction of the compute and GPU memory cost, at the TREC 2019 Deep Learning track [Craswell et al., 2020b]. Notwithstanding these strong results, the TK model suffers from two clear deficiencies. Firstly, because the TK model employs stacked Transformers for query and document encoding, it is challenging to incorporate long body text into this model as the GPU memory requirement of Transformers' self-attention layers grows quadratically with respect to input sequence length. So, for example, to increase the limit on the maximum input sequence length by 4 from 128 to 512 we would require 16 more GPU memory for each of the self-attention layers in the model. Considering that documents can contain thousands of terms, this limits the model to inspecting only a subset of the document text which may have negative implications, such as poorer retrieval quality and under-retrieval of longer documents [Hofstätter et al., 2020].