SupplementaryMaterialsfor TVLT: TextlessVision-LanguageTransformer

Feb-8-2026, 12:17:50 GMT–Neural Information Processing Systems

LanguageInput CMU-MOSEI(A2) HT100MYTT-S Audio 75.3 76.8 Text(ASR-SpeechBrain) 76.5 76.6 Text(ASR-Google) 77.1 77.8 Text(GTTranscripts) 78.9 79.1 Table 2 shows the results ofTVLTon CMUMOSEI sentiment analysis withthe following different inputs: audio, ASR-based text, and ground-truth text transcriptions. ASR-Google and ASR-SpeechBrain refer to Google Cloud API and SpeechBrain, respectively (see main paper Sec. He is underhousearrestandhis mother takesaway his XboxesandTVsissort of a little bit of additionalpunishment. 0.0 -1.0 0.0 0.0 And then last year we had 260 something come outtothedance 1.0 2.0 2.0 1.0 Weusetheconfigurations asfollows: (1)Wesetasingle speech event to have a duration within [0.3s, 1.2s], so that an event is likely to cover a single word. If the silence gap is too large, it is usually a stop between two words. Specifically,weconstruct a 4-layer transformer language model that attends toTVLT encoder outputs via cross-attentions and jointly train the encoder anddecoder.

artificial intelligence, natural language, supplementarymaterialsfor tvlt, (12 more...)

Neural Information Processing Systems

Feb-8-2026, 12:17:50 GMT

Conferences PDF

Add feedback

Country:
- Europe > Serbia (0.05)

Technology:
- Information Technology > Artificial Intelligence > Natural Language (0.78)

Duplicate Docs Excel Report

Title
3ea3134345f2e6228a29f35b86bce24d-Supplemental-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found