Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts