Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation