Lissard: Long and Simple Sequential Reasoning Datasets

Bueno, Mirelle, Lotufo, Roberto, Nogueira, Rodrigo

arXiv.org Artificial Intelligence 

The efficacy of language models, particularly in reasoning tasks, is significantly impacted by longer text lengths than those seen in training [19, 2, 15]. This phenomenon, referred to as "Length Generalization" or "Length Extrapolation" in the literature [25, 30], is also common in models based on the Transformer architecture [20, 16, 8, 32]. Notably, even Large Language Models (LLMs), known for their strong performance in a wide range of tasks and domains, are not immune to this problem [2, 5]. Recent research tried to address this challenge by modifications to the positional embeddings [25, 6, 7, 19, 13] or by using prompting strategies such as scratchpad [23] and chain-of-thought reasoning [28]. Nevertheless, there remains a lack of datasets specifically designed for the systematic evaluation of the problem.