Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks