Unveiling Unexpected Training Data in Internet Video

Communications of the ACM 

During training, the squared L2 error between the clean spectrogram and the predicted spectrogram is used as a loss function to train the network. At inference time, our separation model can be applied to arbitrarily long segments of video and varying numbers of speakers. The latter is achieved by either directly training the model with multiple-input visual streams (one for speaker), or simply by feeding the visual features of the desired speaker to the visual stream. For full details about the architecture and training process, see our full paper.15

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found