MMWOZ: Building Multimodal Agent for Task-oriented Dialogue

Yang, Pu-Hai, Huang, Heyan, Xu, Heng-Da, Sun, Fanshu, Mao, Xian-Ling, Mu, Chaoxu

arXiv.org Artificial Intelligence 

Task-oriented dialogue systems aim to accomplish various user goals through natural language communication, which often involve complexity and require multiple dialogue turns to complete [1-3]. For instance, when assisting users in booking air tickets, a task-oriented dialogue system engages in a conversation to gather information such as the departure place, destination, and departure time. Once sufficient information is obtained, the system automatically handles the booking process. The convenience offered by this natural language interaction has led to a growing interest in task-oriented dialogue systems in recent years [4-6]. Traditionally, task-oriented dialogue systems are generally modeled as intelligent agents that have access to back-end APIs to acquire knowledge in a database [7-9], thereby using this knowledge to help users complete various tasks. These agents follow a pipeline process in the dialogue with users: predict the user's intention, extract slot values in the user's utterance, call API to access the database and response to the user [10-15]. For example, as shown in Figure 1, when a user desires to book a restaurant, the agent engages in a dialogue where, in the first 4 turns, the user seeks a restaurant meeting specific requirements, prompting the agent to call the "find_restaurant" API. In the last 2 turns, the user provides detailed reservation information, leading to the agent calling the "book_restaurant" API. However, in real-world scenarios, the availability of customized APIs for building practical task-oriented dialogue systems is limited, primarily due to two reasons.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found