TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation (Supplementary Materials)

Feb-8-2026, 16:04:36 GMT–Neural Information Processing Systems

Figure 1 shows a diagram of the training scheme for the cross-modal retrieval module. Each multiple choice consists of the correct vision+audio fusion embedding along with a pose embedding. Experimental results if one of the modality is erased. Type of Masking SDR () SIR () SAR () Masking is used for visual modality 7.82 14.39 10.65 Masking is used for pose modality 12.06 18.34 14.17 15% random masking for both visual and pose modality 12.34 18.76 14.37 In this paper, we are using sound separation as our primary task. Therefore, we do not consider masking for the audio modality.

artificial intelligence, machine learning, modality, (12 more...)

Neural Information Processing Systems

Feb-8-2026, 16:04:36 GMT

Conferences PDF

Add feedback

Country:
- North America > Canada
  - Ontario > Toronto (0.15)
  - British Columbia (0.05)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (0.49)
  - Machine Learning (0.48)

Duplicate Docs Excel Report

Title
TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation (Supplementary Materials)

Similar Docs Excel Report more

Title	Similarity	Source
None found