Towards Diverse and Efficient Audio Captioning via Diffusion Models
Xu, Manjie, Li, Chenxing, Tu, Xinyi, Ren, Yong, Fu, Ruibo, Liang, Wei, Yu, Dong
–arXiv.org Artificial Intelligence
Initially, text descriptions are tokenized and transformed into discrete Audio captioning involves detecting sound events and describing token embeddings, which are subsequently mapped into the continuous acoustic scenes using natural language. The community has witnessed latent space in the form of word vectors using an embedding remarkable achievements in audio captioning through Autoregressive function. The audio content is converted into a Mel Spectrogram and (AR) models. Traditional encoder-decoder architectures [1]-[5] use encoded through a pre-trained audio encoder, followed by projection audio encoders to extract audio features and leverage language into the feature space via a lightweight projection module. The decoders to generate coherent descriptions. More recently, Large Language forward process adds noise to the text latent. In the backward process, Model (LLM)-based multimodal models [6]-[8] have emerged, the diffusion model predicts the noise added in each step, with driven by their superior captioning quality and diversity, thanks to the projected audio features conditioned as cross-attention.
arXiv.org Artificial Intelligence
Sep-14-2024