X-VILA: Cross-Modality Alignment for Large Language Model