Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model