Out-of-distribution generalisation is hard: evidence from ARC-like tasks