AITopics | Arunkumar, Anjana

Collaborating Authors

Arunkumar, Anjana

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

LINGO : Visually Debiasing Natural Language Instructions to Support Task Diversity

Arunkumar, Anjana, Sharma, Shubham, Agrawal, Rakhi, Chandrasekaran, Sriram, Bryan, Chris

arXiv.org Artificial IntelligenceApr-12-2023

Cross-task generalization is a significant outcome that defines mastery in natural language understanding. Humans show a remarkable aptitude for this, and can solve many different types of tasks, given definitions in the form of textual instructions and a small set of examples. Recent work with pre-trained language models mimics this learning style: users can define and exemplify a task for the model to attempt as a series of natural language prompts or instructions. While prompting approaches have led to higher cross-task generalization compared to traditional supervised learning, analyzing 'bias' in the task instructions given to the model is a difficult problem, and has thus been relatively unexplored. For instance, are we truly modeling a task, or are we modeling a user's instructions? To help investigate this, we develop LINGO, a novel visual analytics interface that supports an effective, task-driven workflow to (1) help identify bias in natural language task instructions, (2) alter (or create) task instructions to reduce bias, and (3) evaluate pre-trained model performance on debiased task instructions. To robustly evaluate LINGO, we conduct a user study with both novice and expert instruction creators, over a dataset of 1,616 linguistic tasks and their natural language instructions, spanning 55 different languages. For both user groups, LINGO promotes the creation of more difficult tasks for pre-trained models, that contain higher linguistic diversity and lower instruction bias. We additionally discuss how the insights learned in developing and evaluating LINGO can aid in the design of future dashboards that aim to minimize the effort involved in prompt creation across multiple domains.

artificial intelligence, machine learning, natural language, (12 more...)

arXiv.org Artificial Intelligence

2304.06184

Country: North America > United States > Minnesota (0.28)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow

Arunkumar, Anjana, Mishra, Swaroop, Sachdeva, Bhavdeep, Baral, Chitta, Bryan, Chris

arXiv.org Artificial IntelligenceFeb-8-2023

Recent research has shown that language models exploit `artifacts' in benchmarks to solve tasks, rather than truly learning them, leading to inflated model performance. In pursuit of creating better benchmarks, we propose VAIDA, a novel benchmark creation paradigm for NLP, that focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies. VAIDA facilitates sample correction by providing realtime visual feedback and recommendations to improve sample quality. Our approach is domain, model, task, and metric agnostic, and constitutes a paradigm shift for robust, validated, and dynamic benchmark creation via human-and-metric-in-the-loop workflows. We evaluate via expert review and a user study with NASA TLX. We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdworkers and analysts, simultaneously increasing the performance of both user groups with a 45.8% decrease in the level of artifacts in created samples. As a by product of our user study, we observe that created samples are adversarial across models, leading to decreases of 31.3% (BERT), 22.5% (RoBERTa), 14.98% (GPT-3 fewshot) in performance.

crowdworker, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2302.04434

Country: North America > United States (0.49)

Genre:

Research Report (1.00)
Questionnaire & Opinion Survey (0.97)

Industry: Leisure & Entertainment (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.34)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.34)

Add feedback

Investigating the Failure Modes of the AUC metric and Exploring Alternatives for Evaluating Systems in Safety Critical Applications

Mishra, Swaroop, Arunkumar, Anjana, Baral, Chitta

arXiv.org Artificial IntelligenceOct-10-2022

With the increasing importance of safety requirements associated with the use of black box models, evaluation of selective answering capability of models has been critical. Area under the curve (AUC) is used as a metric for this purpose. We find limitations in AUC; e.g., a model having higher AUC is not always better in performing selective answering. We propose three alternate metrics that fix the identified limitations. On experimenting with ten models, our results using the new metrics show that newer and larger pre-trained models do not necessarily show better performance in selective answering. We hope our insights will help develop better models tailored for safety-critical applications.

machine learning, natural language, question answering, (16 more...)

arXiv.org Artificial Intelligence

2210.04466

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

How Robust are Model Rankings: A Leaderboard Customization Approach for Equitable Evaluation

Mishra, Swaroop, Arunkumar, Anjana

arXiv.org Artificial IntelligenceJun-10-2021

Models that top leaderboards often perform unsatisfactorily when deployed in real world applications; this has necessitated rigorous and expensive pre-deployment model testing. A hitherto unexplored facet of model performance is: Are our leaderboards doing equitable evaluation? In this paper, we introduce a task-agnostic method to probe leaderboards by weighting samples based on their `difficulty' level. We find that leaderboards can be adversarially attacked and top performing models may not always be the best models. We subsequently propose alternate evaluation metrics. Our experiments on 10 models show changes in model ranking and an overall reduction in previously reported performance -- thus rectifying the overestimation of AI systems' capabilities. Inspired by behavioral testing principles, we further develop a prototype of a visual analytics tool that enables leaderboard revamping through customization, based on an end user's focus area. This helps users analyze models' strengths and weaknesses, and guides them in the selection of a model best suited for their application scenario. In a user study, members of various commercial product development teams, covering 5 focus areas, find that our prototype reduces pre-deployment development and testing effort by 41% on average.

deep learning, leaderboard, neural network, (18 more...)

arXiv.org Artificial Intelligence

2106.05532

Country: North America > United States > Arizona (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Our Evaluation Metric Needs an Update to Encourage Generalization

Mishra, Swaroop, Arunkumar, Anjana, Bryan, Chris, Baral, Chitta

arXiv.org Artificial IntelligenceJul-14-2020

Models that surpass human performance on several popular benchmarks display significant degradation Several approaches have been proposed to address this issue in performance on exposure to Out of Distribution at various levels: (i) Data - filtering of biases (Bras et al., (OOD) data. Recent research has shown 2020; Li & Vasconcelos, 2019; Li et al., 2018; Wang et al., that models overfit to spurious biases and'hack' 2018), quantifying data quality, controlling data quality, using datasets, in lieu of learning generalizable features active learning, and avoiding the creation of low quality like humans. In order to stop the inflation in data (Mishra et al., 2020; Nie et al., 2019; Gardner et al., model performance - and thus overestimation in 2020; Kaushik et al., 2019), and (ii) Model - utilizing prior AI systems' capabilities - we propose a simple knowledge of biases to train a naive model exploiting biases, and novel evaluation metric, WOOD Score, that and then subsequently training an ensemble of the naive encourages generalization during evaluation.

deep learning, neural network, survey article, (17 more...)

arXiv.org Artificial Intelligence

2007.06898

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)

Add feedback