Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding