Goto

Collaborating Authors

 dialogpt


Fine-Tuning DialoGPT on Common Diseases in Rural Nepal for Medical Conversations

Poudel, Birat, Ghimire, Satyam, Prasad, Er. Prakash Chandra

arXiv.org Artificial Intelligence

Conversational agents are increasingly being explored to support healthcare delivery, particularly in resource-constrained settings such as rural Nepal. Large-scale conversational models typically rely on internet connectivity and cloud infrastructure, which may not be accessible in rural areas. In this study, we fine-tuned DialoGPT, a lightweight generative dialogue model that can operate offline, on a synthetically constructed dataset of doctor-patient interactions covering ten common diseases prevalent in rural Nepal, including common cold, seasonal fever, diarrhea, typhoid fever, gastritis, food poisoning, malaria, dengue fever, tuberculosis, and pneumonia. Despite being trained on a limited, domain-specific dataset, the fine-tuned model produced coherent, contextually relevant, and medically appropriate responses, demonstrating an understanding of symptoms, disease context, and empathetic communication. These results highlight the adaptability of compact, offline-capable dialogue models and the effectiveness of targeted datasets for domain adaptation in low-resource healthcare environments, offering promising directions for future rural medical conversational AI.


ECO Decoding: Entropy-Based Control for Controllability and Fluency in Controllable Dialogue Generation

Shin, Seungmin, Kim, Dooyoung, Ko, Youngjoong

arXiv.org Artificial Intelligence

Controllable Dialogue Generation (CDG) enables chatbots to generate responses with desired attributes, and weighted decoding methods have achieved significant success in the CDG task. However, using a fixed constant value to manage the bias of attribute probabilities makes it challenging to find an ideal control strength that satisfies both controllability and fluency. To address this issue, we propose ECO decoding (Entropy-based COntrol), which dynamically adjusts the control strength at each generation step according to the model's entropy in both the language model and attribute classifier probability distributions. Experiments on the DailyDialog and MultiWOZ datasets demonstrate that ECO decoding consistently improves controllability while maintaining fluency and grammaticality, outperforming prior decoding methods across various models and settings. Furthermore, ECO decoding alleviates probability interpolation issues in multi-attribute generation and consequently demonstrates strong performance in both single and multi-attribute scenarios.


Review for NeurIPS paper: Zero-Resource Knowledge-Grounded Dialogue Generation

Neural Information Processing Systems

Weaknesses: - It is hard to judge whether the proposed method gains good results because of the proposed learning method or the help of the strong pretrained UniLM model. Even though they compare it with DialoGPT in the appendix, I also would like to see the model's performance without UniLM initialization or finetuned DialoGPT with the proposed dataset (e.g., Reddit conversation with top-1 retrieved knowledge). How do you select knowledge for ITDD? (ii) All the examples and details of human evaluation say that authors use ground-truth knowledge. Are all the models use GT knowledge in test time or use top-10 retrieved knowledge from Lucene knowledge retriever? If so, the performance of some baselines would be revised.


Dialogue Language Model with Large-Scale Persona Data Engineering

Hong, Mengze, Zhang, Chen, Chen, Chaotao, Lian, Rongzhong, Jiang, Di

arXiv.org Artificial Intelligence

Maintaining persona consistency is paramount in the application of open-domain dialogue systems, as exemplified by models like ChatGPT. Despite significant advancements, the limited scale and diversity of current persona dialogue datasets remain challenges to achieving robust persona-consistent dialogue models. In this study, drawing inspiration from the success of large-scale pre-training, we introduce PPDS, an open-domain persona dialogue system that employs extensive generative pre-training on a persona dialogue dataset to enhance persona consistency. Specifically, we present a persona extraction model designed to autonomously and precisely generate vast persona dialogue datasets. Additionally, we unveil a pioneering persona augmentation technique to address the invalid persona bias inherent in the constructed dataset. Both quantitative and human evaluations consistently highlight the superior response quality and persona consistency of our proposed model, underscoring its effectiveness.


Experimental Evaluation of Machine Learning Models for Goal-oriented Customer Service Chatbot with Pipeline Architecture

Isa, Nurul Ain Nabilah Mohd, Jawaddi, Siti Nuraishah Agos, Ismail, Azlan

arXiv.org Artificial Intelligence

Integrating machine learning (ML) into customer service chatbots enhances their ability to understand and respond to user queries, ultimately improving service performance. However, they may appear artificial to some users and affecting customer experience. Hence, meticulous evaluation of ML models for each pipeline component is crucial for optimizing performance, though differences in functionalities can lead to unfair comparisons. In this paper, we present a tailored experimental evaluation approach for goal-oriented customer service chatbots with pipeline architecture, focusing on three key components: Natural Language Understanding (NLU), dialogue management (DM), and Natural Language Generation (NLG). Our methodology emphasizes individual assessment to determine optimal ML models. Specifically, we focus on optimizing hyperparameters and evaluating candidate models for NLU (utilizing BERT and LSTM), DM (employing DQN and DDQN), and NLG (leveraging GPT-2 and DialoGPT). The results show that for the NLU component, BERT excelled in intent detection whereas LSTM was superior for slot filling. For the DM component, the DDQN model outperformed DQN by achieving fewer turns, higher rewards, as well as greater success rates. For NLG, the large language model GPT-2 surpassed DialoGPT in BLEU, METEOR, and ROUGE metrics. These findings aim to provide a benchmark for future research in developing and optimizing customer service chatbots, offering valuable insights into model performance and optimal hyperparameters.


StyEmp: Stylizing Empathetic Response Generation via Multi-Grained Prefix Encoder and Personality Reinforcement

Fu, Yahui, Chu, Chenhui, Kawahara, Tatsuya

arXiv.org Artificial Intelligence

Recent approaches for empathetic response generation mainly focus on emotional resonance and user understanding, without considering the system's personality. Consistent personality is evident in real human expression and is important for creating trustworthy systems. To address this problem, we propose StyEmp, which aims to stylize the empathetic response generation with a consistent personality. Specifically, it incorporates a multi-grained prefix mechanism designed to capture the intricate relationship between a system's personality and its empathetic expressions. Furthermore, we introduce a personality reinforcement module that leverages contrastive learning to calibrate the generation model, ensuring that responses are both empathetic and reflective of a distinct personality. Automatic and human evaluations on the EMPATHETICDIALOGUES benchmark show that StyEmp outperforms competitive baselines in terms of both empathy and personality expressions.


On Zero-Shot Counterspeech Generation by LLMs

Saha, Punyajoy, Agrawal, Aalok, Jana, Abhik, Biemann, Chris, Mukherjee, Animesh

arXiv.org Artificial Intelligence

With the emergence of numerous Large Language Models (LLM), the usage of such models in various Natural Language Processing (NLP) applications is increasing extensively. Counterspeech generation is one such key task where efforts are made to develop generative models by fine-tuning LLMs with hatespeech - counterspeech pairs, but none of these attempts explores the intrinsic properties of large language models in zero-shot settings. In this work, we present a comprehensive analysis of the performances of four LLMs namely GPT-2, DialoGPT, ChatGPT and FlanT5 in zero-shot settings for counterspeech generation, which is the first of its kind. For GPT-2 and DialoGPT, we further investigate the deviation in performance with respect to the sizes (small, medium, large) of the models. On the other hand, we propose three different prompting strategies for generating different types of counterspeech and analyse the impact of such strategies on the performance of the models. Our analysis shows that there is an improvement in generation quality for two datasets (17%), however the toxicity increase (25%) with increase in model size. Considering type of model, GPT-2 and FlanT5 models are significantly better in terms of counterspeech quality but also have high toxicity as compared to DialoGPT. ChatGPT are much better at generating counter speech than other models across all metrics. In terms of prompting, we find that our proposed strategies help in improving counter speech generation across all the models.


An Empirical Bayes Framework for Open-Domain Dialogue Generation

Lee, Jing Yang, Lee, Kong Aik, Gan, Woon-Seng

arXiv.org Artificial Intelligence

To engage human users in meaningful conversation, open-domain dialogue agents are required to generate diverse and contextually coherent dialogue. Despite recent advancements, which can be attributed to the usage of pretrained language models, the generation of diverse and coherent dialogue remains an open research problem. A popular approach to address this issue involves the adaptation of variational frameworks. However, while these approaches successfully improve diversity, they tend to compromise on contextual coherence. Hence, we propose the Bayesian Open-domain Dialogue with Empirical Bayes (BODEB) framework, an empirical bayes framework for constructing an Bayesian open-domain dialogue agent by leveraging pretrained parameters to inform the prior and posterior parameter distributions. Empirical results show that BODEB achieves better results in terms of both diversity and coherence compared to variational frameworks.


Towards a Unified Conversational Recommendation System: Multi-task Learning via Contextualized Knowledge Distillation

Jung, Yeongseo, Jung, Eunseo, Chen, Lei

arXiv.org Artificial Intelligence

In Conversational Recommendation System (CRS), an agent is asked to recommend a set of items to users within natural language conversations. To address the need for both conversational capability and personalized recommendations, prior works have utilized separate recommendation and dialogue modules. However, such approach inevitably results in a discrepancy between recommendation results and generated responses. To bridge the gap, we propose a multi-task learning for a unified CRS, where a single model jointly learns both tasks via Contextualized Knowledge Distillation (ConKD). We introduce two versions of ConKD: hard gate and soft gate. The former selectively gates between two task-specific teachers, while the latter integrates knowledge from both teachers. Our gates are computed on-the-fly in a context-specific manner, facilitating flexible integration of relevant knowledge. Extensive experiments demonstrate that our single model significantly improves recommendation performance while enhancing fluency, and achieves comparable results in terms of diversity.


MDDial: A Multi-turn Differential Diagnosis Dialogue Dataset with Reliability Evaluation

Macherla, Srija, Luo, Man, Parmar, Mihir, Baral, Chitta

arXiv.org Artificial Intelligence

Dialogue systems for Automatic Differential Diagnosis (ADD) have a wide range of real-life applications. These dialogue systems are promising for providing easy access and reducing medical costs. Building end-to-end ADD dialogue systems requires dialogue training datasets. However, to the best of our knowledge, there is no publicly available ADD dialogue dataset in English (although non-English datasets exist). Driven by this, we introduce MDDial, the first differential diagnosis dialogue dataset in English which can aid to build and evaluate end-to-end ADD dialogue systems. Additionally, earlier studies present the accuracy of diagnosis and symptoms either individually or as a combined weighted score. This method overlooks the connection between the symptoms and the diagnosis. We introduce a unified score for the ADD system that takes into account the interplay between symptoms and diagnosis. This score also indicates the system's reliability. To the end, we train two moderate-size of language models on MDDial. Our experiments suggest that while these language models can perform well on many natural language understanding tasks, including dialogue tasks in the general domain, they struggle to relate relevant symptoms and disease and thus have poor performance on MDDial. MDDial will be released publicly to aid the study of ADD dialogue research.