Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives

Open in new window