Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark