VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation