Reviews: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
–Neural Information Processing Systems
This work offers a clearly defined extension to TTS systems allowing to build good quality voices (even unseen ones during training of either component) from a few adaptation data-points. Authors do not seem to offer any truly new theoretical extension to "building blocks" of their system, which is based on known components proposed elsewhere (speaker encoder, synthesizer and vocoder are based on previously published models). However, their mutual combination is clever, well-engineered and allows building blocks to by independently estimated in either unsupervised (speaker encoder, where audio transcripts are not needed) or supervised (speech synthesizer) ways, on different corpora. This allows for greater flexibility, reducing at the same time requirements for large amounts of transcribed data for each of the components (i.e. Good points: - clear, fair and convincing experiments - trained and evaluated on public corpora, which greatly increases reproducibility (portion of the experiments is carried on proprietary data, but all have equivalent experiments constrained to publicly available data) Weak points: - it would probably make sense to investigate the additional adaptability in case one gets more data per speaker, it seems your system cannot easily leverage more than 10s of reference speech data Summary: this is a very good study on generating multi-speaker TTS systems from small amounts of target speaker data.
Neural Information Processing Systems
Oct-7-2024, 11:48:38 GMT
- Technology: