Goto

Collaborating Authors

 concatenation








Invertible DenseNets with Concatenated LipSwish

Neural Information Processing Systems

We introduce Invertible Dense Networks (i-DenseNets), a more parameter efficient extension of Residual Flows. The method relies on an analysis of the Lipschitz continuity of the concatenation in DenseNets, where we enforce invertibility of the network by satisfying the Lipschitz constant. Furthermore, we propose a learnable weighted concatenation, which not only improves the model performance but also indicates the importance of the concatenated weighted representation. Additionally, we introduce the Concatenated LipSwish as activation function, for which we show how to enforce the Lipschitz condition and which boosts performance. The new architecture, i-DenseNet, out-performs Residual Flow and other flow-based models on density estimation evaluated in bits per dimension, where we utilize an equal parameter budget. Moreover, we show that the proposed model out-performs Residual Flows when trained as a hybrid model where the model is both a generative and a discriminative model.




TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation (Supplementary Materials)

Neural Information Processing Systems

Figure 1 shows a diagram of the training scheme for the cross-modal retrieval module. Each multiple choice consists of the correct vision+audio fusion embedding along with a pose embedding. Experimental results if one of the modality is erased. Type of Masking SDR () SIR () SAR () Masking is used for visual modality 7.82 14.39 10.65 Masking is used for pose modality 12.06 18.34 14.17 15% random masking for both visual and pose modality 12.34 18.76 14.37 In this paper, we are using sound separation as our primary task. Therefore, we do not consider masking for the audio modality.