MARPLE: A Benchmark for Long-Horizon Inference Emily Jin