Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech