TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation (Supplementary Materials) Mengyu Yang 2,3 Leonid Sigal University of British Columbia 2
–Neural Information Processing Systems
Recall that for the n-way multiple choice setting, n 1 choices are negative pairs and only one pair is positive. Accordingly, for n = 4, 3 distractors are sampled, each with an incorrect pose embedding, while the 4th choice contains the matching pose embedding for the given vision and audio embeddings. In other words, the fusion embedding consisting of the vision and audio embeddings is kept as the anchor while negatives are sampled from the pose embeddings only. Of the 3 negative pose embeddings, 2 are considered "easy" negatives, sampled randomly from the entire training set, while the last one is a "hard" negative, sampled randomly from a pool of 25 embeddings corresponding to the 25 nearest neighbours of the anchor vision embedding. In the n = 3 case, 2 hard negatives and no easy negatives are used, with the same nearest neighbour sampling method based on the anchor embedding.
Neural Information Processing Systems
May-28-2025, 21:34:08 GMT
- Country:
- North America > Canada
- British Columbia (0.41)
- Ontario > Toronto (0.15)
- North America > Canada
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (0.68)
- Vision (0.49)
- Information Technology > Artificial Intelligence