webshop
1 Details about the observation formats Figure 1: Example of the observation of WebShop The observation of WebShop is simplified based on the text_rich
The observation of WikiHow is represented in exactly the same way with Zhang et al. [2023]. Table 1: Patterns of WebShop pages Pattern Description search The page to search for an item itemlisting The page listing the search results item The information page of a specific item others The item description page, item feature page, and review pageThe similarity lookup table is defined in Table 2. 1 Table 2: Lookup table of the page similarity of WebShop search itemlisting item others search 1 0 0 0 itemlisting 0 1 0 0 item 0 0 1 0.3 others 0 0 0.3 1 2.2 Lookup table of the instruction similarity function of WikiHow Table 3. Table 3: Patterns of WikiHow instructions Pattern Name Pattern Template search Search an article to learn . . . Owing to the limit of budgets, a subset of only 20 tasks is sampled from the full test set. The visualization is available in Figure 2. It can be seen that the performance of R However, there seems to be a saturation for the performance, which may be attributed to the limited number of the active exemplars and training tasks. The saturation of the average reward comes later than that of the success rate. Double Q-Learning [van Hasselt, 2010] is usually leveraged to ameliorate over-estimation for lookup-based Q-Learning.
- Information Technology > Security & Privacy (0.67)
- Leisure & Entertainment > Games > Computer Games (0.46)
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
Most existing benchmarks for grounding language in interactive environments either lack realistic linguistic elements, or prove difficult to scale up due to substantial human involvement in the collection of data or feedback signals. We develop WebShop - a simulated e-commerce website environment with 1.18 million real-world products and 12,087 crowd-sourced text instructions. In this environment, an agent needs to navigate multiple types of webpages and issue diverse actions to find, customize, and purchase a product given an instruction. WebShop provides several challenges including understanding compositional instructions, query (re-)formulation, dealing with noisy text in webpages, and performing strategic exploration. We collect over 1,600 human trajectories to first validate the benchmark, then train and evaluate a diverse range of agents using reinforcement learning, imitation learning, and pre-trained image and language models. Our best model achieves a task success rate of 29%, which significantly outperforms rule heuristics but is far lower than expert human performance (59%). We also analyze agent and human trajectories and ablate various model components to provide insights for developing future agents with stronger language understanding and decision making abilities. Finally, we show our agent trained on WebShop exhibits non-trivial sim-to-real transfer when evaluated on amazon.com
- Europe > Middle East > Malta (0.04)
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
- North America > United States > Michigan (0.04)
- Asia > Singapore (0.04)
- (2 more...)
- Education > Educational Setting > Online (0.46)
- Information Technology > Services > e-Commerce Services (0.46)
DEPO: Dual-Efficiency Preference Optimization for LLM Agents
Chen, Sirui, Zhao, Mengshi, Xu, Lei, Zhao, Yuying, Zhu, Beier, Zhang, Hanwang, Zhao, Shengjie, Lu, Chaochao
Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real-world scenarios. Nevertheless, there still lacks systematic definition of LLM agent efficiency, hindering targeted improvements. To this end, we introduce dual-efficiency, comprising (i) step-level efficiency, which minimizes tokens per step, and (ii) trajectory-level efficiency, which minimizes the number of steps to complete a task. Building on this definition, we propose DEPO, a dual-efficiency preference optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in performance. DEPO also generalizes to three out-of-domain math benchmarks and retains its efficiency gains when trained on only 25% of the data. Our project page is at https://opencausalab.github.io/DEPO.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
- North America > United States > Michigan (0.04)
- Asia > Singapore (0.04)
- (2 more...)
- Education > Educational Setting > Online (0.68)
- Information Technology > Services (0.46)
- Leisure & Entertainment > Games (0.68)
- Information Technology > Security & Privacy (0.46)
Fine-tuning with RAG for Improving LLM Learning of New Skills
Ibrahim, Humaid, Rozanov, Nikolai, Rei, Marek
Large language model (LLM) agents deployed for multi-step tasks frequently fail in predictable ways: attempting actions with unmet preconditions, issuing redundant commands, or mishandling environment constraints. While retrieval-augmented generation (RAG) can improve performance by providing runtime guidance, it requires maintaining external knowledge databases and adds computational overhead at every deployment. We propose a simple pipeline that converts inference-time retrieval into learned competence through distillation. Our approach: (1) extracts compact, reusable hints from agent failures, (2) uses these hints to generate improved teacher trajectories via one-shot retrieval at episode start, and (3) trains student models on these trajectories with hint strings removed, forcing internalization rather than memorization. Across two interactive benchmarks, ALFWorld (household tasks) and WebShop (online shopping), distilled students consistently outperform baseline agents, achieving up to 91% success on ALFWorld (vs. The approach generalizes across model scales (7B/14B parameters) and agent architectures (ReAct/StateAct), demonstrating that retrieval benefits can be effectively internalized through targeted fine-tuning without permanent runtime dependencies. Large language models are increasingly deployed as agents that interact with environments to complete multi-step tasks. Success requires not just generating plausible text but maintaining goals across extended interactions, managing state and preconditions, and recovering from errors. Prior work has explored multiple approaches to improve agent performance. Structured prompting methods like ReAct (Y ao et al., 2023b) and StateAct (Rozanov & Rei, 2025) provide scaffolding for reasoning and state tracking. Self-reflection approaches such as Reflexion (Shinn et al., 2023) enable learning from mistakes across multiple attempts. Retrieval-augmented methods (Lewis et al., 2021; Zhao et al., 2024; Fu et al., 2024) inject external knowledge to guide decisions.