MARPLE: A Benchmark for Long-Horizon Inference Emily Jin
–Neural Information Processing Systems
Reconstructing past events requires reasoning across long time horizons. To figure out what happened, humans draw on prior knowledge about the world and human behavior and integrate insights from various sources of evidence including visual, language, and auditory cues. We introduce MARPLE, a benchmark for evaluating long-horizon inference capabilities using multi-modal evidence. Our benchmark features agents interacting with simulated households, supporting vision, language, and auditory stimuli, as well as procedurally generated environments and agent behaviors. Inspired by classic "whodunit" stories, we ask AI models and human participants to infer which agent caused a change in the environment based on a step-by-step replay of what actually happened.
Neural Information Processing Systems
Jun-2-2025, 14:16:38 GMT
- Country:
- Europe > United Kingdom
- England (0.14)
- North America > United States (0.14)
- Europe > United Kingdom
- Genre:
- Research Report
- Experimental Study (0.46)
- New Finding (0.67)
- Research Report
- Industry:
- Leisure & Entertainment (0.67)
- Technology: