LEGO Co-builder: Exploring Fine-Grained Vision-Language Modeling for Multimodal LEGO Assembly Assistants

Huang, Haochen, Pei, Jiahuan, Aliannejadi, Mohammad, Sun, Xin, Ahsan, Moonisa, Yu, Chuang, Ren, Zhaochun, Cesar, Pablo, Wang, Junxiao

Jul-24-2025–arXiv.org Artificial Intelligence

Vision-language models (VLMs) are facing the challenges of understanding and following multimodal assembly instructions, particularly when fine-grained spatial reasoning and precise object state detection are required. In this work, we explore LEGO Co-builder, a hybrid benchmark combining real-world LEGO assembly logic with programmati-cally generated multimodal scenes. The dataset captures stepwise visual states and procedural instructions, allowing controlled evaluation of instruction-following, object detection, and state detection. We introduce a unified framework and assess leading VLMs such as GPT -4o, Gemini, and Qwen-VL, under zero-shot and fine-tuned settings. Our results reveal that even advanced models like GPT -4o struggle with fine-grained assembly tasks, with a maximum F1 score of just 40.54% on state detection, highlighting gaps in fine-grained visual understanding. We release the benchmark, codebase, and generation pipeline to support future research on multi-modal assembly assistants grounded in real-world workflows.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Jul-24-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Netherlands (0.47)

Genre:
- Research Report > New Finding (0.34)

Industry:
- Education (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning
    - Performance Analysis > Accuracy (0.93)
    - Neural Networks > Deep Learning (0.89)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found