Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization

Neural Information Processing Systems 

Self-supervised pre-training recently demonstrates success on large-scale multi-modal data, and state-of-the-art contrastive learning methods often enforce the feature consistency from cross-modality inputs, such as video/audio or video/text pairs.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found