Towards Diverse and Efficient Audio Captioning via Diffusion Models

Xu, Manjie, Li, Chenxing, Tu, Xinyi, Ren, Yong, Fu, Ruibo, Liang, Wei, Yu, Dong

Sep-14-2024–arXiv.org Artificial Intelligence

Initially, text descriptions are tokenized and transformed into discrete Audio captioning involves detecting sound events and describing token embeddings, which are subsequently mapped into the continuous acoustic scenes using natural language. The community has witnessed latent space in the form of word vectors using an embedding remarkable achievements in audio captioning through Autoregressive function. The audio content is converted into a Mel Spectrogram and (AR) models. Traditional encoder-decoder architectures [1]-[5] use encoded through a pre-trained audio encoder, followed by projection audio encoders to extract audio features and leverage language into the feature space via a lightweight projection module. The decoders to generate coherent descriptions. More recently, Large Language forward process adds noise to the text latent. In the backward process, Model (LLM)-based multimodal models [6]-[8] have emerged, the diffusion model predicts the noise added in each step, with driven by their superior captioning quality and diversity, thanks to the projected audio features conditioned as cross-attention.

diffusion model, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Sep-14-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.46)
  - Natural Language > Large Language Model (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found