Arunkumar, Anjana
LINGO : Visually Debiasing Natural Language Instructions to Support Task Diversity
Arunkumar, Anjana, Sharma, Shubham, Agrawal, Rakhi, Chandrasekaran, Sriram, Bryan, Chris
Cross-task generalization is a significant outcome that defines mastery in natural language understanding. Humans show a remarkable aptitude for this, and can solve many different types of tasks, given definitions in the form of textual instructions and a small set of examples. Recent work with pre-trained language models mimics this learning style: users can define and exemplify a task for the model to attempt as a series of natural language prompts or instructions. While prompting approaches have led to higher cross-task generalization compared to traditional supervised learning, analyzing 'bias' in the task instructions given to the model is a difficult problem, and has thus been relatively unexplored. For instance, are we truly modeling a task, or are we modeling a user's instructions? To help investigate this, we develop LINGO, a novel visual analytics interface that supports an effective, task-driven workflow to (1) help identify bias in natural language task instructions, (2) alter (or create) task instructions to reduce bias, and (3) evaluate pre-trained model performance on debiased task instructions. To robustly evaluate LINGO, we conduct a user study with both novice and expert instruction creators, over a dataset of 1,616 linguistic tasks and their natural language instructions, spanning 55 different languages. For both user groups, LINGO promotes the creation of more difficult tasks for pre-trained models, that contain higher linguistic diversity and lower instruction bias. We additionally discuss how the insights learned in developing and evaluating LINGO can aid in the design of future dashboards that aim to minimize the effort involved in prompt creation across multiple domains.
Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow
Arunkumar, Anjana, Mishra, Swaroop, Sachdeva, Bhavdeep, Baral, Chitta, Bryan, Chris
Recent research has shown that language models exploit `artifacts' in benchmarks to solve tasks, rather than truly learning them, leading to inflated model performance. In pursuit of creating better benchmarks, we propose VAIDA, a novel benchmark creation paradigm for NLP, that focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies. VAIDA facilitates sample correction by providing realtime visual feedback and recommendations to improve sample quality. Our approach is domain, model, task, and metric agnostic, and constitutes a paradigm shift for robust, validated, and dynamic benchmark creation via human-and-metric-in-the-loop workflows. We evaluate via expert review and a user study with NASA TLX. We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdworkers and analysts, simultaneously increasing the performance of both user groups with a 45.8% decrease in the level of artifacts in created samples. As a by product of our user study, we observe that created samples are adversarial across models, leading to decreases of 31.3% (BERT), 22.5% (RoBERTa), 14.98% (GPT-3 fewshot) in performance.
Investigating the Failure Modes of the AUC metric and Exploring Alternatives for Evaluating Systems in Safety Critical Applications
Mishra, Swaroop, Arunkumar, Anjana, Baral, Chitta
With the increasing importance of safety requirements associated with the use of black box models, evaluation of selective answering capability of models has been critical. Area under the curve (AUC) is used as a metric for this purpose. We find limitations in AUC; e.g., a model having higher AUC is not always better in performing selective answering. We propose three alternate metrics that fix the identified limitations. On experimenting with ten models, our results using the new metrics show that newer and larger pre-trained models do not necessarily show better performance in selective answering. We hope our insights will help develop better models tailored for safety-critical applications.
How Robust are Model Rankings: A Leaderboard Customization Approach for Equitable Evaluation
Mishra, Swaroop, Arunkumar, Anjana
Models that top leaderboards often perform unsatisfactorily when deployed in real world applications; this has necessitated rigorous and expensive pre-deployment model testing. A hitherto unexplored facet of model performance is: Are our leaderboards doing equitable evaluation? In this paper, we introduce a task-agnostic method to probe leaderboards by weighting samples based on their `difficulty' level. We find that leaderboards can be adversarially attacked and top performing models may not always be the best models. We subsequently propose alternate evaluation metrics. Our experiments on 10 models show changes in model ranking and an overall reduction in previously reported performance -- thus rectifying the overestimation of AI systems' capabilities. Inspired by behavioral testing principles, we further develop a prototype of a visual analytics tool that enables leaderboard revamping through customization, based on an end user's focus area. This helps users analyze models' strengths and weaknesses, and guides them in the selection of a model best suited for their application scenario. In a user study, members of various commercial product development teams, covering 5 focus areas, find that our prototype reduces pre-deployment development and testing effort by 41% on average.
Our Evaluation Metric Needs an Update to Encourage Generalization
Mishra, Swaroop, Arunkumar, Anjana, Bryan, Chris, Baral, Chitta
Models that surpass human performance on several popular benchmarks display significant degradation Several approaches have been proposed to address this issue in performance on exposure to Out of Distribution at various levels: (i) Data - filtering of biases (Bras et al., (OOD) data. Recent research has shown 2020; Li & Vasconcelos, 2019; Li et al., 2018; Wang et al., that models overfit to spurious biases and'hack' 2018), quantifying data quality, controlling data quality, using datasets, in lieu of learning generalizable features active learning, and avoiding the creation of low quality like humans. In order to stop the inflation in data (Mishra et al., 2020; Nie et al., 2019; Gardner et al., model performance - and thus overestimation in 2020; Kaushik et al., 2019), and (ii) Model - utilizing prior AI systems' capabilities - we propose a simple knowledge of biases to train a naive model exploiting biases, and novel evaluation metric, WOOD Score, that and then subsequently training an ensemble of the naive encourages generalization during evaluation.