Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech