A Appendix Latency vs batch size

Neural Information Processing Systems 

Perplexity vs. FLOP count of MIM compared to left-to-right baselines across model sizes. A.1 Model training details To evaluate the effectiveness of "Meet in the Middle" (MIM) pre-training compared to left-to-right autoregressive and "Fill in the Middle" (FIM) pre-training baselines, we adopt standard transformerbased autoregressive language models used in previous works [BMR Perplexity vs. training time of MIM compared to left-to-right baselines across model sizes. For our bidirectional language models, we run the forward model and the backward model in parallel within a single decoder-only architecture, leveraging bidirectional context explicitly during pretraining. We use the sentinel token l2r to specify that the generation comes from the forward model and sentinel token r2l to specify that generation comes from the backward model. During pre-training of our models, we observed that MIM, FIM and autoregressive left-to-right pre-training have similar training wall-clock time, it is because the forward model and the backward model are executed in parallel in MIM pre-training.