AITopics | eskenazi

Collaborating Authors

eskenazi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

BoK: Introducing Bag-of-Keywords Loss for Interpretable Dialogue Response Generation

Dey, Suvodip, Desarkar, Maunendra Sankar

arXiv.org Artificial IntelligenceJan-17-2025

The standard language modeling (LM) loss by itself has been shown to be inadequate for effective dialogue modeling. As a result, various training approaches, such as auxiliary loss functions and leveraging human feedback, are being adopted to enrich open-domain dialogue systems. One such auxiliary loss function is Bag-of-Words (BoW) loss, defined as the cross-entropy loss for predicting all the words/tokens of the next utterance. In this work, we propose a novel auxiliary loss named Bag-of-Keywords (BoK) loss to capture the central thought of the response through keyword prediction and leverage it to enhance the generation of meaningful and interpretable responses in open-domain dialogue systems. BoK loss upgrades the BoW loss by predicting only the keywords or critical words/tokens of the next utterance, intending to estimate the core idea rather than the entire response. We incorporate BoK loss in both encoder-decoder (T5) and decoder-only (DialoGPT) architecture and train the models to minimize the weighted sum of BoK and LM (BoK-LM) loss. We perform our experiments on two popular open-domain dialogue datasets, DailyDialog and Persona-Chat. We show that the inclusion of BoK loss improves the dialogue generation of backbone models while also enabling post-hoc interpretability. We also study the effectiveness of BoK-LM loss as a reference-free metric and observe comparable performance to the state-of-the-art metrics on various dialogue evaluation datasets.

computational linguistic, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2024.sigdial-1.48

2501.10328

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(13 more...)

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment (0.93)
Media > Film (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.75)

Add feedback

Leveraging LLMs for Dialogue Quality Measurement

Jia, Jinghan, Komma, Abi, Leffel, Timothy, Peng, Xujun, Nagesh, Ajay, Soliman, Tamer, Galstyan, Aram, Kumar, Anoop

arXiv.org Artificial IntelligenceJun-25-2024

In task-oriented conversational AI evaluation, unsupervised methods poorly correlate with human judgments, and supervised approaches lack generalization. Recent advances in large language models (LLMs) show robust zeroshot and few-shot capabilities across NLP tasks. This paper explores using LLMs for automated dialogue quality evaluation, experimenting with various configurations on public and proprietary datasets. Manipulating factors such as model size, in-context examples, and selection techniques, we examine "chain-of-thought" (CoT) reasoning and label extraction procedures. Our results show that (1) larger models yield more accurate dialogue labels; (2) algorithmic selection of in-context examples outperforms random selection; (3) CoT reasoning where an LLM is asked to provide justifications before outputting final labels improves performance; and (4) fine-tuned LLMs outperform out-of-the-box ones. Our results indicate that LLMs that are suitably fine-tuned and have sufficient reasoning capabilities can be leveraged for automated dialogue evaluation.

dataset, dialogue evaluation, evaluation, (15 more...)

arXiv.org Artificial Intelligence

2406.17304

Country:

North America > United States > Michigan (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation

Mendonça, John, Pereira, Patrícia, Moniz, Helena, Carvalho, João Paulo, Lavie, Alon, Trancoso, Isabel

arXiv.org Artificial IntelligenceSep-8-2023

Despite significant research effort in the development of automatic dialogue evaluation metrics, little thought is given to evaluating dialogues other than in English. At the same time, ensuring metrics are invariant to semantically similar responses is also an overlooked topic. In order to achieve the desired properties of robustness and multilinguality for dialogue evaluation metrics, we propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting Large Language Models (LLMs). Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks and ranks first place on both the Robust and Multilingual tasks of the DSTC11 Track 4 "Automatic Evaluation Metrics for Open-Domain Dialogue Systems", proving the evaluation capabilities of prompted LLMs.

computational linguistic, evaluation, proceedings, (13 more...)

arXiv.org Artificial Intelligence

2308.16797

Country:

Europe > Portugal > Lisbon > Lisbon (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Czechia > Prague (0.04)
(10 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Approximating Online Human Evaluation of Social Chatbots with Prompting

Svikhnushina, Ekaterina, Pu, Pearl

arXiv.org Artificial IntelligenceAug-25-2023

As conversational models become increasingly available to the general public, users are engaging with this technology in social interactions. Such unprecedented interaction experiences may pose considerable social and psychological risks to the users unless the technology is properly controlled. This highlights the need for scalable and robust evaluation metrics for conversational chatbots. Existing evaluation metrics aim to automate offline user evaluation and approximate human judgment of pre-curated dialogs. However, they are limited in their ability to capture subjective perceptions of users who actually interact with the bots and might not generalize to real-world settings. To address this limitation, we propose an approach to approximate online human evaluation leveraging large language models (LLMs) from the GPT family. We introduce a new Dialog system Evaluation framework based on Prompting (DEP), which enables a fully automatic evaluation pipeline that replicates live user studies and achieves an impressive correlation with human judgment (up to Pearson r=0.95 on a system level). The DEP approach involves collecting synthetic chat logs of evaluated bots with an LLM in the other-play setting, where the LLM is carefully conditioned to follow a specific scenario. We further explore different prompting approaches to produce evaluation scores with the same LLM. The best performing prompts, which contain few-shot demonstrations and instructions, show outstanding performance on the tested dataset and demonstrate the ability to generalize to other dialog corpora.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2304.05253

Country:

Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Oceania > Australia (0.04)
North America > United States > Texas (0.04)
(6 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Three Ways of Using Large Language Models to Evaluate Chat

Plátek, Ondřej, Hudeček, Vojtěch, Schmidtová, Patricia, Lango, Mateusz, Dušek, Ondřej

arXiv.org Artificial IntelligenceAug-12-2023

This paper describes the systems submitted by team6 for ChatEval, the DSTC 11 Track 4 competition. We present three different approaches to predicting turn-level qualities of chatbot responses based on large language models (LLMs). We report improvement over the baseline using dynamic few-shot examples from a vector store for the prompts for ChatGPT. We also analyze the performance of the other two approaches and report needed improvements for future work. We developed the three systems over just two weeks, showing the potential of LLMs for this task. An ablation study conducted after the challenge deadline shows that the new Llama 2 models are closing the performance gap between ChatGPT and open-source LLMs. However, we find that the Llama 2 models do not benefit from few-shot examples in the same way as ChatGPT.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2308.06502

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AnyTOD: A Programmable Task-Oriented Dialog System

Zhao, Jeffrey, Cao, Yuan, Gupta, Raghav, Lee, Harrison, Rastogi, Abhinav, Wang, Mingqiu, Soltau, Hagen, Shafran, Izhak, Wu, Yonghui

arXiv.org Artificial IntelligenceFeb-13-2023

We propose AnyTOD, an end-to-end, zero-shot task-oriented dialog (TOD) system capable of handling unseen tasks without task-specific training. We view TOD as a program executed by a language model (LM), where program logic and ontology is provided by a designer as a schema. To enable generalization to unseen schemas and programs without prior training, AnyTOD adopts a neuro-symbolic approach. A neural LM keeps track of events occurring during a conversation and a symbolic program implementing the dialog policy is executed to recommend next actions AnyTOD should take. This approach drastically reduces data annotation and model training requirements, addressing the enduring challenge of rapidly adapting a TOD system to unseen tasks and domains. We demonstrate state-of-the-art results on STAR, ABCD and SGD benchmarks. We also demonstrate strong zero-shot transfer ability in low-resource settings, such as zero-shot on MultiWOZ. In addition, we release STARv2, an updated version of the STAR dataset with richer annotations, for benchmarking zero-shot end-to-end TOD models.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2212.09939

Country:

North America > United States (0.14)
Asia > Middle East > UAE > Dubai Emirate > Dubai (0.04)
Asia > Singapore (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.64)

Add feedback

Open-Domain Dialog Evaluation using Follow-Ups Likelihood

De Bruyn, Maxime, Lotfi, Ehsan, Buhmann, Jeska, Daelemans, Walter

arXiv.org Artificial IntelligenceSep-12-2022

Automatic evaluation of open-domain dialogs remains an unsolved problem. Moreover, existing methods do not correlate strongly with human annotations. This paper presents a new automated evaluation method using follow-ups: we measure the probability that a language model will continue the conversation with a fixed set of follow-ups (e.g., not really relevant here, what are you trying to say). When compared against twelve existing methods, our new evaluation achieves the highest correlation with human evaluations.

artificial intelligence, chatbot, natural language, (18 more...)

arXiv.org Artificial Intelligence

2209.05185

Country:

North America > United States > Pennsylvania (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Europe > Belgium > Flanders > Antwerp Province > Antwerp (0.04)
Asia > Singapore (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.48)

Add feedback

Interactive Evaluation of Dialog Track at DSTC9

Mehri, Shikib, Feng, Yulan, Gordon, Carla, Alavi, Seyed Hossein, Traum, David, Eskenazi, Maxine

arXiv.org Artificial IntelligenceJul-28-2022

The ultimate goal of dialog research is to develop systems that can be effectively used in interactive settings by real users. To this end, we introduced the Interactive Evaluation of Dialog Track at the 9th Dialog System Technology Challenge. This track consisted of two sub-tasks. The first sub-task involved building knowledge-grounded response generation models. The second sub-task aimed to extend dialog models beyond static datasets by assessing them in an interactive setting with real users. Our track challenges participants to develop strong response generation models and explore strategies that extend them to back-and-forth interactions with real users. The progression from static corpora to interactive evaluation introduces unique challenges and facilitates a more thorough assessment of open-domain dialog systems. This paper provides an overview of the track, including the methodology and results. Furthermore, it provides insights into how to best evaluate open-domain dialog models

dialog, evaluation, human evaluation, (13 more...)

arXiv.org Artificial Intelligence

2207.14403

Country:

North America > United States > California (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.88)

Add feedback

Eskenazi Named International Speech Communication Association Fellow

CMU School of Computer ScienceApr-13-2021, 15:56:27 GMT

Speech processing research is at a high right now, with virtual assistants like Alexa, Siri, Google and others always listening and willing to help. But without a keen eye -- or ear -- for who this technology aims to assist, interest could wane, said Maxine Eskenazi, a Carnegie Mellon University researcher in the School of Computer Science who has worked on speech processing and spoken dialogue systems for decades. "We need to stop focusing on the agent and start focusing on the user," Eskenazi said. "It's only a dialogue if there are two individuals participating. If we make systems that are just fun for us to make but do not serve the user and do not help the user, then they'll stop using Alexa or Google."

eskenazi, international speech communication association fellow, isca, (3 more...)

CMU School of Computer Science

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.40)
Europe > Czechia > South Moravian Region > Brno (0.06)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.63)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.58)

Add feedback

Structured Attention for Unsupervised Dialogue Structure Induction

Qiu, Liang, Zhao, Yizhou, Shi, Weiyan, Liang, Yuan, Shi, Feng, Yuan, Tao, Yu, Zhou, Zhu, Song-Chun

arXiv.org Artificial IntelligenceSep-17-2020

Inducing a meaningful structural representation from one or a set of dialogues is a crucial but challenging task in computational linguistics. Advancement made in this area is critical for dialogue system design and discourse analysis. It can also be extended to solve grammatical inference. In this work, we propose to incorporate structured attention layers into a Variational Recurrent Neural Network (VRNN) model with discrete latent states to learn dialogue structure in an unsupervised fashion. Compared to a vanilla VRNN, structured attention enables a model to focus on different parts of the source sentence embeddings while enforcing a structural inductive bias. Experiments show that on two-party dialogue datasets, VRNN with structured attention learns semantic structures that are similar to templates used to generate this dialogue corpus. While on multi-party dialogue datasets, our model learns an interactive structure demonstrating its capability of distinguishing speakers or addresses, automatically disentangling dialogues without explicit human annotation.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2009.08552

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
(8 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.89)
(2 more...)

Add feedback