TVLT: TextlessVision-LanguageTransformer
–Neural Information Processing Systems
Thechallenge liesinthedifference between textand acoustic signals; textisdiscrete and dense ininformation, while acoustic signals are continuous and sparse in information [26; 7]. Therefore, modality-specific architectures have beenusedtomodel datafromdifferent modalities.
Neural Information Processing Systems
Feb-8-2026, 12:17:46 GMT