A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation