Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning

Open in new window