Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising