Goto

Collaborating Authors

 Akhtar, Zuhaib


Psychological Metrics for Dialog System Evaluation

arXiv.org Artificial Intelligence

We present metrics for evaluating dialog systems through a psychologically-grounded "human" lens in which conversational agents express a diversity of both states (e.g., emotion) and traits (e.g., personality), just as people do. We present five interpretable metrics from established psychology that are fundamental to human communication and relationships: emotional entropy, linguistic style and emotion matching, agreeableness, and empathy. These metrics can be applied (1) across dialogs and (2) on turns within dialogs. The psychological metrics are compared against seven state-of-the-art traditional metrics (e.g., BARTScore and BLEURT) on seven standard dialog system data sets. We also introduce a novel data set, the Three Bot Dialog Evaluation Corpus, which consists of annotated conversations from ChatGPT, GPT-3, and BlenderBot. We demonstrate that our proposed metrics offer novel information; they are uncorrelated with traditional metrics, can be used to meaningfully compare dialog systems, and lead to increased accuracy (beyond existing traditional metrics) in predicting crowd-sourced dialog judgements. The interpretability and unique signal of our psychological metrics make them a valuable tool for evaluating and improving dialog systems.


KEYword based Sampling (KEYS) for Large Language Models

arXiv.org Artificial Intelligence

Question answering (Q/A) can be formulated as a generative task (Mitra, 2017) where the task is to generate an answer given the question and the passage (knowledge, if available). Recent advances in QA task is focused a lot on language model advancements and less on other areas such as sampling(Krishna et al., 2021), (Nakano et al., 2021). Keywords play very important role for humans in language generation. (Humans formulate keywords and use grammar to connect those keywords and work). In the research community, very little focus is on how humans generate answers to a question and how this behavior can be incorporated in a language model. In this paper, we want to explore these two areas combined, i.e., how sampling can be to used generate answers which are close to human-like behavior and factually correct. Hence, the type of decoding algorithm we think should be used for Q/A tasks should also depend on the keywords. These keywords can be obtained from the question, passage or internet results. We use knowledge distillation techniques to extract keywords and sample using these extracted keywords on top of vanilla decoding algorithms when formulating the answer to generate a human-like answer. In this paper, we show that our decoding method outperforms most commonly used decoding methods for Q/A task


How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation

arXiv.org Artificial Intelligence

We release MMSMR, a Massively Multi-System MultiReference dataset to enable future work on metrics and evaluation for dialog. Automatic metrics for dialogue evaluation should be robust proxies for human judgments; however, the verification of robustness is currently far from satisfactory. To quantify the robustness correlation and understand what is necessary in a test set, we create and release an 8-reference dialog dataset by extending single-reference evaluation sets and introduce this new language learning conversation dataset. We then train 1750 systems and evaluate them on our novel test set and the DailyDialog dataset. We release the novel test set, and model hyper parameters, inference outputs, and metric scores for each system on a variety of datasets.


Small-footprint slimmable networks for keyword spotting

arXiv.org Artificial Intelligence

Dynamic neural networks are In this work, we present Slimmable Neural Networks applied another paradigm in which the network dynamically adapts to the problem of small-footprint keyword spotting. We show its computation graph and parameters to different inputs and that slimmable neural networks allow us to create super-nets permits tradeoff between accuracy and inference efficiency from Convolutional Neural Networks and Transformers, from [3]. Another notable work Once-for-All (OFA) network was which sub-networks of different sizes can be extracted. We proposed in [4], which allows one to train one super-network demonstrate the usefulness of these models on in-house voice once and derive multiple sub-networks with different resource assistant data and Google Speech Commands, and focus our contraint requirements. OFA also mitigates the large computational efforts on models for the on-device use case, limiting ourselves cost in conventional neural architecture search (NAS) to less than 250k parameters. We show that slimmable by decoupling the network training and search.