A More Results

Neural Information Processing Systems 

The overall performance in MM-NIAH is shown in Tab. 2, which is obtained by averaging the performance across the six tasks in MM-NIAH. We also provide the performance of each task in Tab. The performance for each context length range is obtained by averaging the accuracy of that context length range across different needle depths. For samples containing multiple needles, we average the depths of each needle to serve as the needle depth of this sample. A.1 More findings In addition to the findings discussed in Section 4.2, we provide more findings here. Placing questions before context does NOT improve model performance. As shown in Figure 1, all models perform poorly in understanding image needles, which we attribute to the fact that models struggle to remember the details of each image in a long multimodal document. An intuitive improvement method is placing the question before the context, which allows the model to see the options first and then read the document. However, as illustrated by the error cases (see the first row in Figure 1), this approach cause models like InternVL-1.5 to fail in following the instructions in the questions. In fact, we observe that this phenomenon holds for all MLLMs, resulting in near-zero performance. Therefore, we do not provide quantitative results but qualitatively analyzed this issue.