From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech