Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning