Aligning Audio-Visual Joint Representations with an Agentic Workflow
–Neural Information Processing Systems
Visual content and accompanied audio signals naturally formulate a joint representation to improve audio-visual (AV) related applications. While studies develop various AV representation learning frameworks, the importance of AV data alignment is usually undermined for achieving high-quality representation. We observe that an audio signal may contain background noise interference. Also, nonsynchronization may appear between audio and video streams.
Neural Information Processing Systems
May-29-2025, 21:43:32 GMT
- Genre:
- Research Report > Experimental Study (1.00)
- Workflow (1.00)
- Industry:
- Education (0.67)
- Information Technology > Security & Privacy (0.46)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (1.00)
- Natural Language
- Chatbot (0.93)
- Large Language Model (1.00)
- Vision (1.00)
- Machine Learning > Neural Networks
- Data Science (1.00)
- Artificial Intelligence
- Information Technology