11fc8c98b46d4cbdfe8157267228f7d7-Supplemental-Conference.pdf
–Neural Information Processing Systems
We follow most of the settings in Uni-Perceiver [93]: cross-entropy loss with label smoothing of 0.1 is adopted for all tasks, and the negative samples for retrieval tasks are only from the local batch in the current GPU. We also apply the same data augmentation techniques as Uni-Perceiver [93] to image and video modalities to avoid overfitting. There are some setting changes to improve the training stability of the original Uni-Perceiver. Following [102], a uniform drop rate for stochastic depth is used across all encoder layers and are adapted according to the model size. Additionally, LayerScale [101] is used to facilitate the convergence of Transformer training, and the same initialization of10 3 is set to all models for simplicity.
Neural Information Processing Systems
Feb-7-2026, 13:05:57 GMT