mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

Open in new window