class
- North America > United States (0.04)
- Europe > Portugal > Braga > Braga (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data
Lu, Yuxuan, Huang, Jing, Han, Yan, Yao, Bingsheng, Bei, Sisong, Gesi, Jiri, Xie, Yaochen, Zheshen, null, Wang, null, He, Qi, Wang, Dakuo
Recent research shows that LLM Agents can generate ``believable'' human behaviors via prompt-only methods, and such agents have been increasingly adopted in downstream applications. However, existing evaluation of these agents only focuses on qualitative believability (whether human raters think they are accurate), leaving open questions of whether LLM agents can accurately generate step-by-step actions mimicking a particular human's behavior in a multi-turn interaction task. In this work, we take shopping as a case study and present the first large-scale quantitative evaluation of state-of-the-art LLMs' ability to accurately simulate human behavior. Using real-world data from 31,865 online shopping sessions containing 230,965 user actions, our evaluation reveals that prompt-based LLMs (DeepSeek-R1, Llama, Claude) achieve only 11.86% accuracy in generating human actions, highlighting a substantial gap in actual behavioral accuracy. Through experiments, we also showcase that strategies as simple as fine-tuning LLMs on real human click-through data augmented with synthesized reasoning traces can greatly enhance models' performance. The fine-tuned Qwen2.5-7B achieves 17.26% action generation accuracy and 33.86% F1 score on final purchase prediction, representing substantial improvements of 5.4% and 13.85% over prompt-only baselines. This work establishes the first rigorous benchmark for human behavior simulation and provides actionable insights for developing more accurate LLM agents for future downstream applications.
- North America > United States (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Retail > Online (0.68)
- Information Technology > Services > e-Commerce Services (0.36)
Dynamic Planning for LLM-based Graphical User Interface Automation
Zhang, Shaoqing, Zhang, Zhuosheng, Chen, Kehai, Ma, Xinbei, Yang, Muyun, Zhao, Tiejun, Zhang, Min
The advent of large language models (LLMs) has spurred considerable interest in advancing autonomous LLMs-based agents, particularly in intriguing applications within smartphone graphical user interfaces (GUIs). When presented with a task goal, these agents typically emulate human actions within a GUI environment until the task is completed. However, a key challenge lies in devising effective plans to guide action prediction in GUI tasks, though planning have been widely recognized as effective for decomposing complex tasks into a series of steps. Specifically, given the dynamic nature of environmental GUIs following action execution, it is crucial to dynamically adapt plans based on environmental feedback and action history.We show that the widely-used ReAct approach fails due to the excessively long historical dialogues. To address this challenge, we propose a novel approach called Dynamic Planning of Thoughts (D-PoT) for LLM-based GUI agents.D-PoT involves the dynamic adjustment of planning based on the environmental feedback and execution history. Experimental results reveal that the proposed D-PoT significantly surpassed the strong GPT-4V baseline by +12.7% (34.66% $\rightarrow$ 47.36%) in accuracy. The analysis highlights the generality of dynamic planning in different backbone LLMs, as well as the benefits in mitigating hallucinations and adapting to unseen tasks. Code is available at https://github.com/sqzhang-lazy/D-PoT.
- Asia > Pakistan (0.05)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Vietnam (0.04)
- (4 more...)
ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data
Shen, Junhong, Jain, Atishay, Xiao, Zedian, Amlekar, Ishan, Hadji, Mouad, Podolny, Aaron, Talwalkar, Ameet
Large Language Model (LLM) agents are rapidly improving to handle increasingly complex web-based tasks. Most of these agents rely on general-purpose, proprietary models like GPT-4 and focus on designing better prompts to improve their planning abilities. However, general-purpose LLMs are not specifically trained to understand specialized web contexts such as HTML, and they often struggle with long-horizon planning. We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data collected from over 250 domains corresponding to 6 billion tokens. This simple yet effective approach shows substantial gains over prompting-based agents on existing benchmarks -- ScribeAgent achieves state-of-the-art direct generation performance on Mind2Web and improves the task success rate by 7.3% over the previous best text-only web agents on WebArena. We further perform detailed ablation studies on various fine-tuning design choices and provide insights into LLM selection, training recipes, context window optimization, and effect of dataset sizes.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > New York (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Workflow (1.00)
- Research Report (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
POS-tagging to highlight the skeletal structure of sentences
The article describes the process of developing a model for applying partial annotation to text using BERT learning transfer. Process of data preparation and evaluation of obtained results. It has been found that the proposed method makes it possible to achieve good results in marking text. Attention Is All You Need // 2023. URL: https://github.com/chakki-works/seqeval. 12. Liao W., Veeramachaneni S. A simple semi-supervised algorithm for named entity recognition Boulder, Colorado: Association for Computational Linguistics, 2009.C. 58-65.
- North America > United States > Colorado > Boulder County > Boulder (0.25)
- Europe > Russia (0.05)
- Asia > Russia (0.05)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.05)
ODRL: A Benchmark for Off-Dynamics Reinforcement Learning
Lyu, Jiafei, Xu, Kang, Xu, Jiacheng, Yan, Mengbei, Yang, Jingwen, Zhang, Zongzhang, Bai, Chenjia, Lu, Zongqing, Li, Xiu
We consider off-dynamics reinforcement learning (RL) where one needs to transfer policies across different domains with dynamics mismatch. Despite the focus on developing dynamics-aware algorithms, this field is hindered due to the lack of a standard benchmark. To bridge this gap, we introduce ODRL, the first benchmark tailored for evaluating off-dynamics RL methods. ODRL contains four experimental settings where the source and target domains can be either online or offline, and provides diverse tasks and a broad spectrum of dynamics shifts, making it a reliable platform to comprehensively evaluate the agent's adaptation ability to the target domain. Furthermore, ODRL includes recent off-dynamics RL algorithms in a unified framework and introduces some extra baselines for different settings, all implemented in a single-file manner. To unpack the true adaptation capability of existing methods, we conduct extensive benchmarking experiments, which show that no method has universal advantages across varied dynamics shifts. We hope this benchmark can serve as a cornerstone for future research endeavors.
- North America > United States (0.04)
- Europe > Portugal > Braga > Braga (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis
Chen, Lei, Yan, Feng, Zhong, Yujie, Chen, Shaoxiang, Jie, Zequn, Ma, Lin
Multimodal Large Language Models (MLLM) have made significant progress in the field of document analysis. Despite this, existing benchmarks typically focus only on extracting text and simple layout information, neglecting the complex interactions between elements in structured documents such as mind maps and flowcharts. To address this issue, we introduce the new benchmark named MindBench, which not only includes meticulously constructed bilingual authentic or synthetic images, detailed annotations, evaluation metrics and baseline models, but also specifically designs five types of structured understanding and parsing tasks. These tasks include full parsing, partial parsing, position-related parsing, structured Visual Question Answering (VQA), and position-related VQA, covering key areas such as text recognition, spatial awareness, relationship discernment, and structured parsing. Extensive experimental results demonstrate the substantial potential and significant room for improvement in current models' ability to handle structured document information. We anticipate that the launch of MindBench will significantly advance research and application development in structured document analysis technology. MindBench is available at: https://miasanlei.github.io/MindBench.github.io/.
A Multimodal Foundation Agent for Financial Trading: Tool-Augmented, Diversified, and Generalist
Zhang, Wentao, Zhao, Lingxuan, Xia, Haochong, Sun, Shuo, Sun, Jiaze, Qin, Molei, Li, Xinyi, Zhao, Yuqing, Zhao, Yilei, Cai, Xinyu, Zheng, Longtao, Wang, Xinrun, An, Bo
Financial trading is a crucial component of the markets, informed by a multimodal information landscape encompassing news, prices, and Kline charts, and encompasses diverse tasks such as quantitative trading and high-frequency trading with various assets. While advanced AI techniques like deep learning and reinforcement learning are extensively utilized in finance, their application in financial trading tasks often faces challenges due to inadequate handling of multimodal data and limited generalizability across various tasks. To address these challenges, we present FinAgent, a multimodal foundational agent with tool augmentation for financial trading. FinAgent's market intelligence module processes a diverse range of data-numerical, textual, and visual-to accurately analyze the financial market. Its unique dual-level reflection module not only enables rapid adaptation to market dynamics but also incorporates a diversified memory retrieval system, enhancing the agent's ability to learn from historical data and improve decision-making processes. The agent's emphasis on reasoning for actions fosters trust in its financial decisions. Moreover, FinAgent integrates established trading strategies and expert insights, ensuring that its trading approaches are both data-driven and rooted in sound financial principles. With comprehensive experiments on 6 financial datasets, including stocks and Crypto, FinAgent significantly outperforms 9 state-of-the-art baselines in terms of 6 financial metrics with over 36% average improvement on profit. Specifically, a 92.27% return (a 84.39% relative improvement) is achieved on one dataset. Notably, FinAgent is the first advanced multimodal foundation agent designed for financial trading tasks.
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.06)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Singapore > Central Region > Singapore (0.04)
- (2 more...)
- Overview (1.00)
- Research Report (0.81)
The Good Robot Podcast: Featuring Heather Zheng
Hosted by Eleanor Drage and Kerry Mackereth, The Good Robot is a podcast which explores the many complex intersections between gender, feminism and technology. In this episode, we talk to Heather Zheng, who makes technologies that stop everyday surveillance. This includes bracelets that stopped devices from listening and on you, to more secure biometric technologies that can protect us by identifying us by, for example, our dance moves. Most famously, Zheng is one of the computer scientists behind Nightshade, which helps artists protect their work by'poisoning' AI training data sets. Heather is the Neubauer Professor of Computer Science at University of Chicago.
- North America > United States > Illinois > Cook County > Chicago (0.30)
- North America > United States > Maryland > Prince George's County > College Park (0.07)
- North America > United States > California (0.07)
- (2 more...)
Machine learning LEGO image recognition: Using virtual data and YOLOv3
I have been working a lot with LEGO and 3D models lately. For my current project I am looking to build a LEGO image recognition program. My ideal scenario is to grab a handful of LEGO, toss them on the table, take a picture, and have the program catalog the pieces. The biggest challenge I encounter with any machine learning project is collecting and formatting the training data. I am pretty sure this is the biggest challenge everyone encounters with machine learning.