LongTail-Swap: benchmarking language models' abilities on rare words