BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

Open in new window