TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation (Supplementary Materials)

Apr-25-2026, 21:40:27 GMT–Neural Information Processing Systems

Recall that for the n-way multiple choice setting, n 1 choices are negative pairs and only one pair is positive. Accordingly, for n = 4, 3 distractors are sampled, each with an incorrect pose embedding, while the 4th choice contains the matching pose embedding for the given vision and audio embeddings. In other words, the fusion embedding consisting of the vision and audio embeddings is kept as the anchor while negatives are sampled from the pose embeddings only. Of the 3 negative pose embeddings, 2 are considered "easy" negatives, sampled randomly from the entire training set, while the last one is a "hard" negative, sampled randomly from a pool of 25 embeddings corresponding to the 25 nearest neighbours of the anchor vision embedding. In the n = 3case, 2 hard negatives and no easy negatives are used, with the same nearest neighbour sampling method based on the anchorshared weights embedding.

artificial intelligence, machine learning, modality, (12 more...)

Neural Information Processing Systems

Apr-25-2026, 21:40:27 GMT

Conferences PDF

Add feedback

Country:
- North America > Canada (0.15)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.68)
  - Vision (0.49)

Duplicate Docs Excel Report

Title
TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation (Supplementary Materials)

Similar Docs Excel Report more

Title	Similarity	Source
None found