Aligning Audio-Visual Joint Representations with an Agentic Workflow

Open in new window