Towards Diverse and Efficient Audio Captioning via Diffusion Models

Xu, Manjie, Li, Chenxing, Tu, Xinyi, Ren, Yong, Fu, Ruibo, Liang, Wei, Yu, Dong

arXiv.org Artificial Intelligence 

Initially, text descriptions are tokenized and transformed into discrete Audio captioning involves detecting sound events and describing token embeddings, which are subsequently mapped into the continuous acoustic scenes using natural language. The community has witnessed latent space in the form of word vectors using an embedding remarkable achievements in audio captioning through Autoregressive function. The audio content is converted into a Mel Spectrogram and (AR) models. Traditional encoder-decoder architectures [1]-[5] use encoded through a pre-trained audio encoder, followed by projection audio encoders to extract audio features and leverage language into the feature space via a lightweight projection module. The decoders to generate coherent descriptions. More recently, Large Language forward process adds noise to the text latent. In the backward process, Model (LLM)-based multimodal models [6]-[8] have emerged, the diffusion model predicts the noise added in each step, with driven by their superior captioning quality and diversity, thanks to the projected audio features conditioned as cross-attention.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found