MolVision: Molecular Property Prediction with Vision Language Models (Supplementary Material) Contents

Neural Information Processing Systems 

The ViT-L/14 encoder processes images into visual tokens, which the LLaMA-2-7B decoder converts into text.