Goto

Collaborating Authors

 Parthasarathi, Prasanna


Resona: Improving Context Copying in Linear Recurrence Models with Retrieval

arXiv.org Artificial Intelligence

Recent shifts in the space of large language model (LLM) research have shown an increasing focus on novel architectures to compete with prototypical Transformer-based models that have long dominated this space. Linear recurrent models have proven to be a viable competitor due to their computational efficiency. However, such models still demonstrate a sizable gap compared to Transformers in terms of in-context learning among other tasks that require recalling information from a context. In this work, we introduce __Resona__, a simple and scalable framework for augmenting linear recurrent models with retrieval. __Resona__~augments models with the ability to integrate retrieved information from the provided input context, enabling tailored behavior to diverse task requirements. Experiments on a variety of linear recurrent models demonstrate that __Resona__-augmented models observe significant performance gains on a variety of synthetic as well as real-world natural language tasks, highlighting its ability to act as a general purpose method to improve the in-context learning and language modeling abilities of linear recurrent LLMs.


Do Robot Snakes Dream like Electric Sheep? Investigating the Effects of Architectural Inductive Biases on Hallucination

arXiv.org Artificial Intelligence

The growth in prominence of large language models (LLMs) in everyday life can be largely attributed to their generative abilities, yet some of this is also owed to the risks and costs associated with their use. On one front is their tendency to \textit{hallucinate} false or misleading information, limiting their reliability. On another is the increasing focus on the computational limitations associated with traditional self-attention based LLMs, which has brought about new alternatives, in particular recurrent models, meant to overcome them. Yet it remains uncommon to consider these two concerns simultaneously. Do changes in architecture exacerbate/alleviate existing concerns about hallucinations? Do they affect how and where they occur? Through an extensive evaluation, we study how these architecture-based inductive biases affect the propensity to hallucinate. While hallucination remains a general phenomenon not limited to specific architectures, the situations in which they occur and the ease with which specific types of hallucinations can be induced can significantly differ based on the model architecture. These findings highlight the need for better understanding both these problems in conjunction with each other, as well as consider how to design more universal techniques for handling hallucinations.


CHARP: Conversation History AwaReness Probing for Knowledge-grounded Dialogue Systems

arXiv.org Artificial Intelligence

In this work, we dive deep into one of the popular knowledge-grounded dialogue benchmarks that focus on faithfulness, FaithDial. We show that a significant portion of the FaithDial data contains annotation artifacts, which may bias models towards completely ignoring the conversation history. We therefore introduce CHARP, a diagnostic test set, designed for an improved evaluation of hallucinations in conversational model. CHARP not only measures hallucination but also the compliance of the models to the conversation task. Our extensive analysis reveals that models primarily exhibit poor performance on CHARP due to their inability to effectively attend to and reason over the conversation history. Furthermore, the evaluation methods of FaithDial fail to capture these shortcomings, neglecting the conversational history. Our findings indicate that there is substantial room for contribution in both dataset creation and hallucination evaluation for knowledge-grounded dialogue, and that CHARP can serve as a tool for monitoring the progress in this particular research area. CHARP is publicly available at https://huggingface.co/datasets/huawei-noah/CHARP


Towards Practical Tool Usage for Continually Learning LLMs

arXiv.org Artificial Intelligence

Large language models (LLMs) show an innate skill for solving language based tasks. But insights have suggested an inability to adjust for information or task-solving skills becoming outdated, as their knowledge, stored directly within their parameters, remains static in time. Tool use helps by offloading work to systems that the LLM can access through an interface, but LLMs that use them still must adapt to nonstationary environments for prolonged use, as new tools can emerge and existing tools can change. Nevertheless, tools require less specialized knowledge, therefore we hypothesize they are better suited for continual learning (CL) as they rely less on parametric memory for solving tasks and instead focus on learning when to apply pre-defined tools. To verify this, we develop a synthetic benchmark and follow this by aggregating existing NLP tasks to form a more realistic testing scenario. While we demonstrate scaling model size is not a solution, regardless of tool usage, continual learning techniques can enable tool LLMs to both adapt faster while forgetting less, highlighting their potential as continual learners.


Measuring the Knowledge Acquisition-Utilization Gap in Pretrained Language Models

arXiv.org Artificial Intelligence

While pre-trained language models (PLMs) have shown evidence of acquiring vast amounts of knowledge, it remains unclear how much of this parametric knowledge is actually usable in performing downstream tasks. We propose a systematic framework to measure parametric knowledge utilization in PLMs. Our framework first extracts knowledge from a PLM's parameters and subsequently constructs a downstream task around this extracted knowledge. Performance on this task thus depends exclusively on utilizing the model's possessed knowledge, avoiding confounding factors like insufficient signal. As an instantiation, we study factual knowledge of PLMs and measure utilization across 125M to 13B parameter PLMs. We observe that: (1) PLMs exhibit two gaps - in acquired vs. utilized knowledge, (2) they show limited robustness in utilizing knowledge under distribution shifts, and (3) larger models close the acquired knowledge gap but the utilized knowledge gap remains. Overall, our study provides insights into PLMs' capabilities beyond their acquired knowledge.


Deep Learning on a Healthy Data Diet: Finding Important Examples for Fairness

arXiv.org Artificial Intelligence

Data-driven predictive solutions predominant in commercial applications tend to suffer from biases and stereotypes, which raises equity concerns. Prediction models may discover, use, or amplify spurious correlations based on gender or other protected personal characteristics, thus discriminating against marginalized groups. Mitigating gender bias has become an important research focus in natural language processing (NLP) and is an area where annotated corpora are available. Data augmentation reduces gender bias by adding counterfactual examples to the training dataset. In this work, we show that some of the examples in the augmented dataset can be not important or even harmful for fairness. We hence propose a general method for pruning both the factual and counterfactual examples to maximize the model's fairness as measured by the demographic parity, equality of opportunity, and equality of odds. The fairness achieved by our method surpasses that of data augmentation on three text classification datasets, using no more than half of the examples in the augmented dataset. Our experiments are conducted using models of varying sizes and pre-training settings.


Demystifying Neural Language Models' Insensitivity to Word-Order

arXiv.org Artificial Intelligence

Recent research analyzing the sensitivity of natural language understanding models to word-order perturbations have shown that the state-of-the-art models in several language tasks may have a unique way to understand the text that could seldom be explained with conventional syntax and semantics. In this paper, we investigate the insensitivity of natural language models to word-order by quantifying perturbations and analysing their effect on neural models' performance on language understanding tasks in GLUE benchmark. Towards that end, we propose two metrics - the Direct Neighbour Displacement (DND) and the Index Displacement Count (IDC) - that score the local and global ordering of tokens in the perturbed texts and observe that perturbation functions found in prior literature affect only the global ordering while the local ordering remains relatively unperturbed. We propose perturbations at the granularity of sub-words and characters to study the correlation between DND, IDC and the performance of neural language models on natural language tasks. We find that neural language models - pretrained and non-pretrained Transformers, LSTMs, and Convolutional architectures - require local ordering more so than the global ordering of tokens. The proposed metrics and the suite of perturbations allow a systematic way to study the (in)sensitivity of neural language understanding models to varying degree of perturbations.


On Task-Level Dialogue Composition of Generative Transformer Model

arXiv.org Artificial Intelligence

Task-oriented dialogue systems help users accomplish tasks such as booking a movie ticket and ordering food via conversation. Generative models parameterized by a deep neural network are widely used for next turn response generation in such systems. It is natural for users of the system to want to accomplish multiple tasks within the same conversation, but the ability of generative models to compose multiple tasks is not well studied. In this work, we begin by studying the effect of training human-human task-oriented dialogues towards improving the ability to compose multiple tasks on Transformer generative models. To that end, we propose and explore two solutions: (1) creating synthetic multiple task dialogue data for training from human-human single task dialogue and (2) forcing the encoder representation to be invariant to single and multiple task dialogues using an auxiliary loss. The results from our experiments highlight the difficulty of even the sophisticated variant of transformer model in learning to compose multiple tasks from single task dialogues.


How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics

arXiv.org Artificial Intelligence

Though generative dialogue modeling is widely seen as a language modeling task, the task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user. The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent. Such metrics were earlier shown to not correlate with the human judgement. In this work, we observe that human evaluation of dialogue agents can be inconclusive due to the lack of sufficient information for appropriate evaluation. The automatic metrics are deterministic yet shallow and human evaluation can be relevant yet inconclusive. To bridge this gap in evaluation, we propose designing a set of probing tasks to evaluate dialogue models. The hand-crafted tasks are aimed at quantitatively evaluating a generative dialogue model's understanding beyond the token-level evaluation on the generated text. The probing tasks are deterministic like automatic metrics and requires human judgement in their designing; benefiting from the best of both worlds. With experiments on probe tasks we observe that, unlike RNN based architectures, transformer model may not be learning to comprehend the input text despite its generated text having higher overlap with the target text.


Extending Neural Generative Conversational Model using External Knowledge Sources

arXiv.org Artificial Intelligence

The use of connectionist approaches in conversational agents has been progressing rapidly due to the availability of large corpora. However current generative dialogue models often lack coherence and are content poor. This work proposes an architecture to incorporate unstructured knowledge sources to enhance the next utterance prediction in chit-chat type of generative dialogue models. We focus on Sequence-to-Sequence (Seq2Seq) conversational agents trained with the Reddit News dataset, and consider incorporating external knowledge from Wikipedia summaries as well as from the NELL knowledge base. Our experiments show faster training time and improved perplexity when leveraging external knowledge.