Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
–Neural Information Processing Systems
Self-supervised pre-training recently demonstrates success on large-scale multi-modal data, and state-of-the-art contrastive learning methods often enforce the feature consistency from cross-modality inputs, such as video/audio or video/text pairs.
Neural Information Processing Systems
Aug-19-2025, 16:22:12 GMT
- Country:
- North America > United States > Texas
- Brazos County > College Station (0.04)
- Travis County > Austin (0.04)
- North America > United States > Texas
- Technology: