Speaker Diarization of Scripted Audiovisual Content
Virkar, Yogesh, Thompson, Brian, Paturi, Rohit, Srinivasan, Sundararajan, Federico, Marcello
–arXiv.org Artificial Intelligence
The media localization industry usually requires a verbatim script of the final film or TV production in order to create subtitles or dubbing scripts in a foreign language. In particular, the verbatim script (i.e. as-broadcast script) must be structured into a sequence of dialogue lines each including time codes, speaker name and transcript. Current speech recognition technology alleviates the transcription step. However, state-of-the-art speaker diarization models still fall short on TV shows for two main reasons: (i) their inability to track a large number of speakers, (ii) their low accuracy in detecting frequent speaker changes. To mitigate this problem, we present a novel approach to leverage production scripts used during the shooting process, to extract pseudo-labeled data for the speaker diarization task. We propose a novel semi-supervised approach and demonstrate improvements of 51.7% relative to two unsupervised baseline models on our metrics on a 66 show test set.
arXiv.org Artificial Intelligence
Aug-4-2023
- Country:
- Europe > Czechia (0.14)
- North America > United States (0.14)
- Genre:
- Research Report (1.00)
- Industry:
- Leisure & Entertainment (1.00)
- Media
- Film (1.00)
- Television (0.90)
- Technology: