AITopics | instruction manual

High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. An auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. Experimentally, various RL algorithms obtain significant improvement in performance and training speed when assisted by our design.

name change, play atari, read and reap, (10 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment > Games > Computer Games (0.83)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.83)

Add feedback

Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals via Vision-Language Models

Tie, Chenrui, Sun, Shengxiang, Lin, Yudi, Wang, Yanbo, Li, Zhongrui, Zhong, Zhouhan, Zhu, Jinxuan, Pang, Yiman, Chen, Haonan, Chen, Junting, Wu, Ruihai, Shao, Lin

arXiv.org Artificial IntelligenceOct-21-2025

Assembly hinges on reliably forming connections between parts; yet most robotic approaches plan assembly sequences and part poses while treating connectors as an afterthought. Connections represent the critical "last mile" of assembly execution, while task planning may sequence operations and motion plan may position parts, the precise establishment of physical connections ultimately determines assembly success or failure. In this paper, we consider connections as first-class primitives in assembly representation, including connector types, specifications, quantities, and placement locations. Drawing inspiration from how humans learn assembly tasks through step-by-step instruction manuals, we present Manual2Skill++, a vision-language framework that automatically extracts structured connection information from assembly manuals. We encode assembly tasks as hierarchical graphs where nodes represent parts and sub-assemblies, and edges explicitly model connection relationships between components. A large-scale vision-language model parses symbolic diagrams and annotations in manuals to instantiate these graphs, leveraging the rich connection knowledge embedded in human-designed instructions. We curate a dataset containing over 20 assembly tasks with diverse connector types to validate our representation extraction approach, and evaluate the complete task understanding-to-execution pipeline across four complex assembly scenarios in simulation, spanning furniture, toys, and manufacturing components with real-world correspondence.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.16344

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.68)
Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (0.48)
(2 more...)

Add feedback

LEGO Co-builder: Exploring Fine-Grained Vision-Language Modeling for Multimodal LEGO Assembly Assistants

Huang, Haochen, Pei, Jiahuan, Aliannejadi, Mohammad, Sun, Xin, Ahsan, Moonisa, Yu, Chuang, Ren, Zhaochun, Cesar, Pablo, Wang, Junxiao

arXiv.org Artificial IntelligenceJul-24-2025

Vision-language models (VLMs) are facing the challenges of understanding and following multimodal assembly instructions, particularly when fine-grained spatial reasoning and precise object state detection are required. In this work, we explore LEGO Co-builder, a hybrid benchmark combining real-world LEGO assembly logic with programmati-cally generated multimodal scenes. The dataset captures stepwise visual states and procedural instructions, allowing controlled evaluation of instruction-following, object detection, and state detection. We introduce a unified framework and assess leading VLMs such as GPT -4o, Gemini, and Qwen-VL, under zero-shot and fine-tuned settings. Our results reveal that even advanced models like GPT -4o struggle with fine-grained assembly tasks, with a maximum F1 score of just 40.54% on state detection, highlighting gaps in fine-grained visual understanding. We release the benchmark, codebase, and generation pipeline to support future research on multi-modal assembly assistants grounded in real-world workflows.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2507.05515

Country: Europe > Netherlands (0.47)

Genre: Research Report > New Finding (0.34)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals

Neural Information Processing SystemsOct-9-2024, 08:37:37 GMT

High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework.

instruction manual, play atari, read and reap, (5 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment > Games > Computer Games (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.46)

Add feedback

Autonomous Workflow for Multimodal Fine-Grained Training Assistants Towards Mixed Reality

Pei, Jiahuan, Viola, Irene, Huang, Haochen, Wang, Junxiao, Ahsan, Moonisa, Ye, Fanghua, Yiming, Jiang, Sai, Yao, Wang, Di, Chen, Zhumin, Ren, Pengjie, Cesar, Pablo

arXiv.org Artificial IntelligenceJun-5-2024

Autonomous artificial intelligence (AI) agents have emerged as promising protocols for automatically understanding the language-based environment, particularly with the exponential development of large language models (LLMs). However, a fine-grained, comprehensive understanding of multimodal environments remains under-explored. This work designs an autonomous workflow tailored for integrating AI agents seamlessly into extended reality (XR) applications for fine-grained training. We present a demonstration of a multimodal fine-grained training assistant for LEGO brick assembly in a pilot XR environment. Specifically, we design a cerebral language agent that integrates LLM with memory, planning, and interaction with XR tools and a vision-language agent, enabling agents to decide their actions based on past experiences. Furthermore, we introduce LEGO-MRTA, a multimodal fine-grained assembly dialogue dataset synthesized automatically in the workflow served by a commercial LLM. This dataset comprises multimodal instruction manuals, conversations, XR responses, and vision question answering. Last, we present several prevailing open-resource LLMs as benchmarks, assessing their performance with and without fine-tuning on the proposed dataset. We anticipate that the broader impact of this workflow will advance the development of smarter assistants for seamless user interaction in XR environments, fostering research in both AI and HCI communities.

agent, dataset, trainee, (15 more...)

arXiv.org Artificial Intelligence

2405.13034

Country:

Europe > Netherlands > South Holland > Delft (0.04)
North America > United States (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
(4 more...)

Genre: Workflow (1.00)

Industry: Education > Educational Setting (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

AgentKit: Flow Engineering with Graphs, not Coding

Wu, Yue, Fan, Yewen, Min, So Yeon, Prabhumoye, Shrimai, McAleer, Stephen, Bisk, Yonatan, Salakhutdinov, Ruslan, Li, Yuanzhi, Mitchell, Tom

arXiv.org Artificial IntelligenceApr-17-2024

We propose an intuitive LLM prompting framework (AgentKit) for multifunctional agents. AgentKit offers a unified framework for explicitly constructing a complex "thought process" from simple natural language prompts. The basic building block in AgentKit is a node, containing a natural language prompt for a specific subtask. The user then puts together chains of nodes, like stacking LEGO pieces. The chains of nodes can be designed to explicitly enforce a naturally structured "thought process". For example, for the task of writing a paper, one may start with the thought process of 1) identify a core message, 2) identify prior research gaps, etc. The nodes in AgentKit can be designed and combined in different ways to implement multiple advanced capabilities including on-the-fly hierarchical planning, reflection, and learning from interactions. In addition, due to the modular nature and the intuitive design to simulate explicit human thought process, a basic agent could be implemented as simple as a list of prompts for the subtasks and therefore could be designed and tuned by someone without any programming experience. Quantitatively, we show that agents designed through AgentKit achieve SOTA performance on WebShop and Crafter. These advances underscore AgentKit's potential in making LLM agents effective and accessible for a wider range of applications. https://github.com/holmeswww/AgentKit

agent, agentkit, subgoal, (15 more...)

arXiv.org Artificial Intelligence

2404.11483

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Vietnam > Hanoi > Hanoi (0.04)

Genre: Workflow (0.68)

Industry: Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

The revenge of the video game manual

The GuardianJan-17-2024, 10:00:26 GMT

Players of a certain age will no doubt have fond memories of the paper instruction manuals that once came with every video game. Dan Marshall, creator of The Swindle and Lair of the Clockwork God, certainly does. He remembers the ritual of poring over the manual for a new game on the bus ride home from the shops, trying to absorb all of its information in preparation for playing the game itself. He vividly recalls receiving Bullfrog's 1993 game Syndicate via mail order early one morning, then impatiently waiting hours for his brother to wake up so he could play it on the PC in his room. "And for that solid time I did nothing but read the manual over and over and over again," Marshall says.

artificial intelligence, marshall, video game, (17 more...)

The Guardian

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology: Information Technology > Artificial Intelligence > Games (0.89)

Add feedback

Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals

Wu, Yue, Fan, Yewen, Liang, Paul Pu, Azaria, Amos, Li, Yuanzhi, Mitchell, Tom M.

arXiv.org Artificial IntelligenceOct-26-2023

High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. An auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. Experimentally, various RL algorithms obtain significant improvement in performance and training speed when assisted by our design.

arxiv preprint arxiv, information, interaction, (13 more...)

arXiv.org Artificial Intelligence

2302.04449

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Filters

Collaborating Authors

instruction manual

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

034d7bfeace2a9a258648b16fc626298-Paper-Conference.pdf

Manuals

Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals

Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals via Vision-Language Models

LEGO Co-builder: Exploring Fine-Grained Vision-Language Modeling for Multimodal LEGO Assembly Assistants

Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals

Autonomous Workflow for Multimodal Fine-Grained Training Assistants Towards Mixed Reality

AgentKit: Flow Engineering with Graphs, not Coding

The revenge of the video game manual

Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals