MolVision: Molecular Property Prediction with Vision Language Models (Supplementary Material) Contents
–Neural Information Processing Systems
The ViT-L/14 encoder processes images into visual tokens, which the LLaMA-2-7B decoder converts into text.
Neural Information Processing Systems
Jun-22-2026, 17:05:04 GMT