Latent-Domain Predictive Neural Speech Coding
Jiang, Xue, Peng, Xiulian, Xue, Huaying, Zhang, Yuan, Lu, Yan
–arXiv.org Artificial Intelligence
This article has been accepted for publication in IEEE/ACM Transactions on Audio, Speech and Language Processing. This is the author's version which has not been fully edited and content may change prior to final publication. Abstract--Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods. However, existing neural audio/speech codecs employ either acoustic features or learned blind features with a convolutional neural network for encoding, by which there are still temporal redundancies within encoded features. Specifically, the extracted features are encoded conditioned on a prediction from past quantized latent frames so that temporal correlations are further removed. Moreover, we introduce a learnable compression on the timefrequency input to adaptively adjust the attention paid to main frequencies and details at different bitrates. A differentiable vector quantization scheme based on distance-to-soft mapping and Gumbel-Softmax is proposed to better model the latent distributions with rate constraint. Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than Opus at 9 kbps, and TF-Codec at 3 kbps outperforms both EVS at 9.6 Numerous studies are conducted to demonstrate the effectiveness of these techniques.
arXiv.org Artificial Intelligence
May-25-2023
- Country:
- Asia (0.28)
- North America (0.28)
- Genre:
- Research Report (0.64)
- Industry:
- Telecommunications (0.46)
- Technology: