SupplementaryMaterialsfor TVLT: TextlessVision-LanguageTransformer

Neural Information Processing Systems 

LanguageInput CMU-MOSEI(A2) HT100MYTT-S Audio 75.3 76.8 Text(ASR-SpeechBrain) 76.5 76.6 Text(ASR-Google) 77.1 77.8 Text(GTTranscripts) 78.9 79.1 Table 2 shows the results ofTVLTon CMUMOSEI sentiment analysis withthe following different inputs: audio, ASR-based text, and ground-truth text transcriptions. ASR-Google and ASR-SpeechBrain refer to Google Cloud API and SpeechBrain, respectively (see main paper Sec. He is underhousearrestandhis mother takesaway his XboxesandTVsissort of a little bit of additionalpunishment. 0.0 -1.0 0.0 0.0 And then last year we had 260 something come outtothedance 1.0 2.0 2.0 1.0 Weusetheconfigurations asfollows: (1)Wesetasingle speech event to have a duration within [0.3s, 1.2s], so that an event is likely to cover a single word. If the silence gap is too large, it is usually a stop between two words. Specifically,weconstruct a 4-layer transformer language model that attends toTVLT encoder outputs via cross-attentions and jointly train the encoder anddecoder.