natural language instruction
Supplementary Materials for On the Effects of Data Scale on Computer Control Agents
For completeness, in the following we include a datasheet based on the format of [1]. For what purpose was the dataset created? Was there a specific task in mind? Who created the dataset (e.g., which team, research group) and on behalf of which entity What do the instances that comprise the dataset represent (e.g., documents, photos, people, The dataset contains episodes of human demonstrations for mobile device control. How many instances are there in total (of each type, if appropriate)?
- Information Technology > Artificial Intelligence (1.00)
- Information Technology > Communications > Mobile (0.72)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- Information Technology > Communications > Mobile (0.71)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
- Information Technology > Human Computer Interaction (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > Canada (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- Leisure & Entertainment > Games > Computer Games (0.71)
- Education (0.68)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > Iowa (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > Iowa (0.04)
- Asia > Middle East > Israel (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Colorado > Denver County > Denver (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > China > Hong Kong (0.04)
AndroidInTheWild: A Large-Scale Dataset For Android Device Control
There is a growing interest in device-control systems that can interpret human natural language instructions and execute them on a digital device by directly controlling its user interface. We present a dataset for device-control research, Android in the Wild (AitW), which is orders of magnitude larger than current datasets. The dataset contains human demonstrations of device interactions, including the screens and actions, and corresponding natural language instructions. It consists of 715k episodes spanning 30k unique instructions, four versions of Android (v10-13), and eight device types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It contains multi-step tasks that require semantic understanding of language and visual context. This dataset poses a new challenge: actions available through the user interface must be inferred from their visual appearance, and, instead of simple UI element-based actions, the action space consists of precise gestures (e.g., horizontal scrolls to operate carousel widgets). We organize our dataset to encourage robustness analysis of device-control systems, i.e., how well a system performs in the presence of new task descriptions, new applications, or new platform versions. We develop two agents and report performance across the dataset.
- Telecommunications (0.43)
- Information Technology > Software (0.43)
Introspective Planning: Aligning Robots' Uncertainty with Inherent Task Ambiguity
Large language models (LLMs) exhibit advanced reasoning skills, enabling robots to comprehend natural language instructions and strategically plan high-level actions through proper grounding. However, LLM hallucination may result in robots confidently executing plans that are misaligned with user goals or even unsafe in critical scenarios. Additionally, inherent ambiguity in natural language instructions can introduce uncertainty into the LLM's reasoning and planning. We propose introspective planning, a systematic approach that guides LLMs to refine their own uncertainty in alignment with inherent task ambiguity. Our approach constructs a knowledge base containing introspective reasoning examples as post-hoc rationalizations of human-selected safe and compliant plans, which are retrieved during deployment. Evaluations on three tasks, including a new safe mobile manipulation benchmark, indicate that introspection substantially improves both compliance and safety over state-of-the-art LLM-based planning methods. Additionally, we empirically show that introspective planning, in combination with conformal prediction, achieves tighter confidence bounds, maintaining statistical success guarantees while minimizing unnecessary user clarification requests.