Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?