TVLT: TextlessVision-LanguageTransformer

Neural Information Processing Systems 

Thechallenge liesinthedifference between textand acoustic signals; textisdiscrete and dense ininformation, while acoustic signals are continuous and sparse in information [26; 7]. Therefore, modality-specific architectures have beenusedtomodel datafromdifferent modalities.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found