Image2Struct: Benchmarking Structure Extraction for Vision-Language Models

Neural Information Processing Systems 

We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images.Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data.In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot).The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score.This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures.We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention.We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images.