EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance

Kim, Jaeyeon, Jeon, Minjeon, Jung, Jaeyoon, Woo, Sang Hoon, Lee, Jinjoo

Sep-2-2024–arXiv.org Artificial Intelligence

Although EnCLAP exhibits impressive performance, the study by Kim et al. lacks sufficient experimental evaluation for determining In this work, we aim to analyze and optimize the EnCLAP framework, the optimal models for the model components. Notably, a state-of-the-art model in automated audio captioning. We the authors do not investigate alternative sequence-level acoustic investigate the impact of modifying the acoustic encoder components, features beyond CLAP. Furthermore, for timestep-level acoustic explore pretraining with different dataset scales, and study the features, while they demonstrate that discrete codec input outperforms effectiveness of a reranking scheme. Through extensive experimentation continuous input, their analysis is restricted to a single setup and quantitative analysis of generated captions, we develop using EnCodec, without exploring other options or configurations. EnCLAP++, an enhanced version that significantly surpasses the Additionally, Kim et al. acknowledge the issue of overfitting in original.

caption, dataset, variant, (13 more...)

arXiv.org Artificial Intelligence

Sep-2-2024

arXiv.org PDF

Add feedback

Country:
- Asia
  - South Korea > Seoul
    - Seoul (0.05)
  - Japan > Honshū
    - Kantō > Tokyo Metropolis Prefecture > Tokyo (0.16)

Genre:
- Research Report
  - New Finding (0.47)
  - Promising Solution (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.94)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found