World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering