Predicting Density of States via Multi-modal Transformer