Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning

Chien, Chung-Ming, Tjandra, Andros, Vyas, Apoorv, Le, Matt, Shi, Bowen, Hsu, Wei-Ning

arXiv.org Artificial Intelligence 

In this work, we propose Voicebox Adapter, Our contributions are as follows: (1) we propose Voicebox a novel approach that integrates fine-grained conditions into a Adapter, which augments Voicebox, a pre-trained speech pre-trained Voicebox speech generation model using a crossattention generation model, with fine-grained controllability; (2) we explore module. To ensure a smooth integration of newly different efficient fine-tuning methods to bridge the gap added modules with pre-trained ones, we explore various efficient between pre-trained parameters and new fine-grained conditioning fine-tuning approaches. Our experiment shows that the modules; (3) we show that Voicebox Adapter can generalize LoRA with bias-tuning configuration yields the best performance, across various fine-grained conditions, attaining performance enhancing controllability without compromising speech comparable to that achieved by fine-tuning the entire model quality. Across three fine-grained conditional generation tasks, with significantly fewer fine-tuned parameters; (4) we conduct we demonstrate the effectiveness and resource efficiency of experiments using varying amounts of fine-tuning data and different Voicebox Adapter. Follow-up experiments further highlight the hidden dimension sizes, analyzing the performance of robustness of Voicebox Adapter across diverse data setups.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found