TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation (Supplementary Materials)
–Neural Information Processing Systems
Figure 1 shows a diagram of the training scheme for the cross-modal retrieval module. Each multiple choice consists of the correct vision+audio fusion embedding along with a pose embedding. Experimental results if one of the modality is erased. Type of Masking SDR () SIR () SAR () Masking is used for visual modality 7.82 14.39 10.65 Masking is used for pose modality 12.06 18.34 14.17 15% random masking for both visual and pose modality 12.34 18.76 14.37 In this paper, we are using sound separation as our primary task. Therefore, we do not consider masking for the audio modality.
Neural Information Processing Systems
Feb-8-2026, 16:04:36 GMT
- Country:
- North America > Canada
- British Columbia (0.05)
- Ontario > Toronto (0.15)
- North America > Canada
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (0.48)
- Vision (0.49)
- Information Technology > Artificial Intelligence