Phantom: Subject-consistent video generation via cross-modal alignment