Cross-Modal Alignment via Variational Copula Modelling