3Dify: a Framework for Procedural 3D-CG Generation Assisted by LLMs Using MCP and RAG

Hayashi, Shun-ichiro, Mukunoki, Daichi, Hoshino, Tetsuya, Ohshima, Satoshi, Katagiri, Takahiro

arXiv.org Artificial Intelligence 

Abstract--This paper proposes "3Dify," a procedural 3D computer graphics (3D-CG) generation framework utilizing Large Language Models (LLMs). The framework enables users to generate 3D-CG content solely through natural language instructions. For 3D-CG generation support, 3Dify automates the operation of various Digital Content Creation (DCC) tools via MCP . When DCC tools do not support MCP-based interaction, the framework employs the Computer-Using Agent (CUA) method to automate Graphical User Interface (GUI) operations. Moreover, to enhance image generation quality, 3Dify allows users to provide feedback by selecting preferred images from multiple candidates. The LLM then learns variable patterns from these selections and applies them to subsequent generations. Furthermore, 3Dify supports the integration of locally deployed LLMs, enabling users to utilize custom-developed models and to reduce both time and monetary costs associated with external API calls by leveraging their own computational resources. Its applications extend beyond entertainment industries such as movies and games to areas including product design in manufacturing, surgical simulation in healthcare, education, and digital-twin technologies that replicate the real world within virtual spaces.