Can Memory-Augmented Language Models Generalize on Reasoning-in-a-Haystack Tasks?