instruction manual
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals
High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. An auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. Experimentally, various RL algorithms obtain significant improvement in performance and training speed when assisted by our design.
Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals via Vision-Language Models
Tie, Chenrui, Sun, Shengxiang, Lin, Yudi, Wang, Yanbo, Li, Zhongrui, Zhong, Zhouhan, Zhu, Jinxuan, Pang, Yiman, Chen, Haonan, Chen, Junting, Wu, Ruihai, Shao, Lin
Assembly hinges on reliably forming connections between parts; yet most robotic approaches plan assembly sequences and part poses while treating connectors as an afterthought. Connections represent the critical "last mile" of assembly execution, while task planning may sequence operations and motion plan may position parts, the precise establishment of physical connections ultimately determines assembly success or failure. In this paper, we consider connections as first-class primitives in assembly representation, including connector types, specifications, quantities, and placement locations. Drawing inspiration from how humans learn assembly tasks through step-by-step instruction manuals, we present Manual2Skill++, a vision-language framework that automatically extracts structured connection information from assembly manuals. We encode assembly tasks as hierarchical graphs where nodes represent parts and sub-assemblies, and edges explicitly model connection relationships between components. A large-scale vision-language model parses symbolic diagrams and annotations in manuals to instantiate these graphs, leveraging the rich connection knowledge embedded in human-designed instructions. We curate a dataset containing over 20 assembly tasks with diverse connector types to validate our representation extraction approach, and evaluate the complete task understanding-to-execution pipeline across four complex assembly scenarios in simulation, spanning furniture, toys, and manufacturing components with real-world correspondence.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- Leisure & Entertainment > Sports (1.00)
- Leisure & Entertainment > Games > Computer Games (1.00)
- Education (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.72)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
LEGO Co-builder: Exploring Fine-Grained Vision-Language Modeling for Multimodal LEGO Assembly Assistants
Huang, Haochen, Pei, Jiahuan, Aliannejadi, Mohammad, Sun, Xin, Ahsan, Moonisa, Yu, Chuang, Ren, Zhaochun, Cesar, Pablo, Wang, Junxiao
Vision-language models (VLMs) are facing the challenges of understanding and following multimodal assembly instructions, particularly when fine-grained spatial reasoning and precise object state detection are required. In this work, we explore LEGO Co-builder, a hybrid benchmark combining real-world LEGO assembly logic with programmati-cally generated multimodal scenes. The dataset captures stepwise visual states and procedural instructions, allowing controlled evaluation of instruction-following, object detection, and state detection. We introduce a unified framework and assess leading VLMs such as GPT -4o, Gemini, and Qwen-VL, under zero-shot and fine-tuned settings. Our results reveal that even advanced models like GPT -4o struggle with fine-grained assembly tasks, with a maximum F1 score of just 40.54% on state detection, highlighting gaps in fine-grained visual understanding. We release the benchmark, codebase, and generation pipeline to support future research on multi-modal assembly assistants grounded in real-world workflows.
- Europe > Netherlands > North Holland > Amsterdam (0.05)
- Europe > Netherlands > South Holland > Leiden (0.04)
- Europe > Netherlands > South Holland > Delft (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals
High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework.
Autonomous Workflow for Multimodal Fine-Grained Training Assistants Towards Mixed Reality
Pei, Jiahuan, Viola, Irene, Huang, Haochen, Wang, Junxiao, Ahsan, Moonisa, Ye, Fanghua, Yiming, Jiang, Sai, Yao, Wang, Di, Chen, Zhumin, Ren, Pengjie, Cesar, Pablo
Autonomous artificial intelligence (AI) agents have emerged as promising protocols for automatically understanding the language-based environment, particularly with the exponential development of large language models (LLMs). However, a fine-grained, comprehensive understanding of multimodal environments remains under-explored. This work designs an autonomous workflow tailored for integrating AI agents seamlessly into extended reality (XR) applications for fine-grained training. We present a demonstration of a multimodal fine-grained training assistant for LEGO brick assembly in a pilot XR environment. Specifically, we design a cerebral language agent that integrates LLM with memory, planning, and interaction with XR tools and a vision-language agent, enabling agents to decide their actions based on past experiences. Furthermore, we introduce LEGO-MRTA, a multimodal fine-grained assembly dialogue dataset synthesized automatically in the workflow served by a commercial LLM. This dataset comprises multimodal instruction manuals, conversations, XR responses, and vision question answering. Last, we present several prevailing open-resource LLMs as benchmarks, assessing their performance with and without fine-tuning on the proposed dataset. We anticipate that the broader impact of this workflow will advance the development of smarter assistants for seamless user interaction in XR environments, fostering research in both AI and HCI communities.
- Europe > Netherlands > South Holland > Delft (0.04)
- North America > United States (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- (4 more...)
AgentKit: Flow Engineering with Graphs, not Coding
Wu, Yue, Fan, Yewen, Min, So Yeon, Prabhumoye, Shrimai, McAleer, Stephen, Bisk, Yonatan, Salakhutdinov, Ruslan, Li, Yuanzhi, Mitchell, Tom
We propose an intuitive LLM prompting framework (AgentKit) for multifunctional agents. AgentKit offers a unified framework for explicitly constructing a complex "thought process" from simple natural language prompts. The basic building block in AgentKit is a node, containing a natural language prompt for a specific subtask. The user then puts together chains of nodes, like stacking LEGO pieces. The chains of nodes can be designed to explicitly enforce a naturally structured "thought process". For example, for the task of writing a paper, one may start with the thought process of 1) identify a core message, 2) identify prior research gaps, etc. The nodes in AgentKit can be designed and combined in different ways to implement multiple advanced capabilities including on-the-fly hierarchical planning, reflection, and learning from interactions. In addition, due to the modular nature and the intuitive design to simulate explicit human thought process, a basic agent could be implemented as simple as a list of prompts for the subtasks and therefore could be designed and tuned by someone without any programming experience. Quantitatively, we show that agents designed through AgentKit achieve SOTA performance on WebShop and Crafter. These advances underscore AgentKit's potential in making LLM agents effective and accessible for a wider range of applications. https://github.com/holmeswww/AgentKit
The revenge of the video game manual
Players of a certain age will no doubt have fond memories of the paper instruction manuals that once came with every video game. Dan Marshall, creator of The Swindle and Lair of the Clockwork God, certainly does. He remembers the ritual of poring over the manual for a new game on the bus ride home from the shops, trying to absorb all of its information in preparation for playing the game itself. He vividly recalls receiving Bullfrog's 1993 game Syndicate via mail order early one morning, then impatiently waiting hours for his brother to wake up so he could play it on the PC in his room. "And for that solid time I did nothing but read the manual over and over and over again," Marshall says.
Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals
Wu, Yue, Fan, Yewen, Liang, Paul Pu, Azaria, Amos, Li, Yuanzhi, Mitchell, Tom M.
High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. An auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. Experimentally, various RL algorithms obtain significant improvement in performance and training speed when assisted by our design.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)