Supplementary Material: Learning Representations from Audio-Visual Spatial Alignment
–Neural Information Processing Systems
These are transformer networks of base dimension 512 and expansion ration 4. In other words, All models were trained using the Adam optimized. Pre-training hyper-parameters are summarized in Table 2. For semantic segmentation, we used a lightweight FPN segmentation head. Semantic segmentation predictions are then computed based on the features at all levels. This shows the use of spatial negatives is complementary to AVC.
Neural Information Processing Systems
May-28-2025, 22:34:11 GMT
- Technology: