Aligning Audio-Visual Joint Representations with an Agentic Workflow