Unveiling Unexpected Training Data in Internet Video
During training, the squared L2 error between the clean spectrogram and the predicted spectrogram is used as a loss function to train the network. At inference time, our separation model can be applied to arbitrarily long segments of video and varying numbers of speakers. The latter is achieved by either directly training the model with multiple-input visual streams (one for speaker), or simply by feeding the visual features of the desired speaker to the visual stream. For full details about the architecture and training process, see our full paper.15
Jul-27-2021, 21:00:03 GMT
- Country:
- North America > United States
- New York > New York County > New York City (0.04)
- Asia
- Middle East > Israel (0.04)
- Japan > Honshū
- Chūbu > Toyama Prefecture > Toyama (0.04)
- North America > United States
- Industry:
- Leisure & Entertainment (0.93)
- Media
- Television (0.68)
- Film (0.68)
- Photography (0.46)
- Technology: