Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Laurençon, Hugo, Tronchon, Léo, Sanh, Victor

arXiv.org Artificial Intelligence 

Current advancements in vision-language models (VLMs) have significantly improved their capabilities, enabling them to master a variety of tasks including image captioning, question answering, and optical character recognition (OCR) (OpenAI et al., 2023; Team et al., 2023; Hong et al., 2023; Liu et al., 2024a). Despite these achievements, the task of converting screenshots of websites or web components into usable HTML code--a process highly valuable to web developers--remains relatively unexplored, particularly in the open-source community. The development and open-source release of a model capable of such a conversion could unlock new AI-powered tools for UI developers, facilitating the creation of no-code modules and plugins for design tools like Figma. For instance, the ability to rapidly transform a design sketch into a functional UI component and code could significantly increase the iteration pace for UI developers. We posit that the primary challenge for VLMs to achieve proficiency in this specific task does not stem from the inherent difficulty of the task itself. Rather, it is the lack of a large, high-quality, dataset of pairs of HTML codes and their associated screenshots that poses the primary obstacle.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found