Not enough data to create a plot.
Try a different view from the menu above.
Mitra, Arindam
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Mukherjee, Subhabrata, Mitra, Arindam, Jawahar, Ganesh, Agarwal, Sahaj, Palangi, Hamid, Awadallah, Ahmed
Recent research has focused on enhancing the capability of smaller models through imitation learning, drawing on the outputs generated by large foundation models (LFMs). A number of issues impact the quality of these models, ranging from limited imitation signals from shallow LFM outputs; small scale homogeneous training data; and most notably a lack of rigorous evaluation resulting in overestimating the small model's capability as they tend to learn to imitate the style, but not the reasoning process of LFMs. To address these challenges, we develop Orca (We are working with our legal team to publicly release a diff of the model weights in accordance with LLaMA's release policy to be published at https://aka.ms/orca-lm), a 13-billion parameter model that learns to imitate the reasoning process of LFMs. Orca learns from rich signals from GPT-4 including explanation traces; step-by-step thought processes; and other complex instructions, guided by teacher assistance from ChatGPT. To promote this progressive learning, we tap into large-scale and diverse imitation data with judicious sampling and selection. Orca surpasses conventional state-of-the-art instruction-tuned models such as Vicuna-13B by more than 100% in complex zero-shot reasoning benchmarks like Big-Bench Hard (BBH) and 42% on AGIEval. Moreover, Orca reaches parity with ChatGPT on the BBH benchmark and shows competitive performance (4 pts gap with optimized system message) in professional and academic examinations like the SAT, LSAT, GRE, and GMAT, both in zero-shot settings without CoT; while trailing behind GPT-4. Our research indicates that learning from step-by-step explanations, whether these are generated by humans or more advanced AI models, is a promising direction to improve model capabilities and skills.
Instruction Tuned Models are Quick Learners
Gupta, Himanshu, Sawant, Saurabh Arjun, Mishra, Swaroop, Nakamura, Mutsumi, Mitra, Arindam, Mashetty, Santosh, Baral, Chitta
Instruction tuning of language models has demonstrated the ability to enhance model generalization to unseen tasks via in-context learning using a few examples. However, typical supervised learning still requires a plethora of downstream training data for finetuning. Often in real-world situations, there is a scarcity of data available for finetuning, falling somewhere between few shot inference and fully supervised finetuning. In this work, we demonstrate the sample efficiency of instruction tuned models over various tasks by estimating the minimal downstream training data required by them to perform transfer learning and match the performance of state-of-the-art (SOTA) supervised models. We conduct experiments on 119 tasks from Super Natural Instructions (SuperNI) in both the single task learning (STL) and multi task learning (MTL) settings. Our findings reveal that, in the STL setting, instruction tuned models equipped with 25% of the downstream train data surpass the SOTA performance on the downstream tasks. In the MTL setting, an instruction tuned model trained on only 6% of downstream training data achieve SOTA, while using 100% of the training data results in a 3.69% points improvement (ROUGE-L 74.68) over the previous SOTA. We conduct an analysis on T5 vs Tk-Instruct by developing several baselines to demonstrate that instruction tuning aids in increasing both sample efficiency and transfer learning. Additionally, we observe a consistent ~4% performance increase in both settings when pre-finetuning is performed with instructions. Finally, we conduct a categorical study and find that contrary to previous results, tasks in the question rewriting and title generation categories suffer from instruction tuning.
Can Transformers Reason About Effects of Actions?
Banerjee, Pratyay, Baral, Chitta, Luo, Man, Mitra, Arindam, Pal, Kuntal, Son, Tran C., Varshney, Neeraj
A recent work has shown that transformers are able to "reason" with facts and rules in a limited setting where the rules are natural language expressions of conjunctions of conditions implying a conclusion. Since this suggests that transformers may be used for reasoning with knowledge given in natural language, we do a rigorous evaluation of this with respect to a common form of knowledge and its corresponding reasoning -- the reasoning about effects of actions. Reasoning about action and change has been a top focus in the knowledge representation subfield of AI from the early days of AI and more recently it has been a highlight aspect in common sense question answering. We consider four action domains (Blocks World, Logistics, Dock-Worker-Robots and a Generic Domain) in natural language and create QA datasets that involve reasoning about the effects of actions in these domains. We investigate the ability of transformers to (a) learn to reason in these domains and (b) transfer that learning from the generic domains to the other domains.
Towards Question Format Independent Numerical Reasoning: A Set of Prerequisite Tasks
Mishra, Swaroop, Mitra, Arindam, Varshney, Neeraj, Sachdeva, Bhavdeep, Baral, Chitta
Numerical reasoning is often important to accurately understand the world. Recently, several format-specific datasets have been proposed, such as numerical reasoning in the settings of Natural Language Inference (NLI), Reading Comprehension (RC), and Question Answering (QA). Several format-specific models and architectures in response to those datasets have also been proposed. However, there exists a strong need for a benchmark which can evaluate the abilities of models, in performing question format independent numerical reasoning, as (i) the numerical reasoning capabilities we want to teach are not controlled by question formats, (ii) for numerical reasoning technology to have the best possible application, it must be able to process language and reason in a way that is not exclusive to a single format, task, dataset or domain. In pursuit of this goal, we introduce NUMBERGAME, a multifaceted benchmark to evaluate model performance across numerical reasoning tasks of eight diverse formats. We add four existing question types in our compilation. Two of the new types we add are about questions that require external numerical knowledge, commonsense knowledge and domain knowledge. For building a more practical numerical reasoning system, NUMBERGAME demands four capabilities beyond numerical reasoning: (i) detecting question format directly from data (ii) finding intermediate common format to which every format can be converted (iii) incorporating commonsense knowledge (iv) handling data imbalance across formats. We build several baselines, including a new model based on knowledge hunting using a cheatsheet. However, all baselines perform poorly in contrast to the human baselines, indicating the hardness of our benchmark. Our work takes forward the recent progress in generic system development, demonstrating the scope of these under-explored tasks.
A Generate-Validate Approach to Answering Questions about Qualitative Relationships
Mitra, Arindam, Baral, Chitta, Bhattacharjee, Aurgho, Shrivastava, Ishan
Qualitative relationships describe how increasing or decreasing one property (e.g. altitude) affects another (e.g. temperature). They are an important aspect of natural language question answering and are crucial for building chatbots or voice agents where one may enquire about qualitative relationships. Recently a dataset about question answering involving qualitative relationships has been proposed, and a few approaches to answer such questions have been explored, in the heart of which lies a semantic parser that converts the natural language input to a suitable logical form. A problem with existing semantic parsers is that they try to directly convert the input sentences to a logical form. Since the output language varies with each application, it forces the semantic parser to learn almost everything from scratch. In this paper, we show that instead of using a semantic parser to produce the logical form, if we apply the generate-validate framework i.e. generate a natural language description of the logical form and validate if the natural language description is followed from the input text, we get a better scope for transfer learning and our method outperforms the state-of-the-art by a large margin of 7.93%.
Careful Selection of Knowledge to solve Open Book Question Answering
Banerjee, Pratyay, Pal, Kuntal Kumar, Mitra, Arindam, Baral, Chitta
Open book question answering is a type of natural language based QA (NLQA) where questions are expected to be answered with respect to a given set of open book facts, and common knowledge about a topic. Recently a challenge involving such QA, OpenBookQA, has been proposed. Unlike most other NLQA tasks that focus on linguistic understanding, Open-BookQA requires deeper reasoning involving linguistic understanding as well as reasoning with common knowledge. In this paper we address QA with respect to the OpenBookQA dataset and combine state of the art language models with abductive information retrieval (IR), information gain based re-ranking, passage selection and weighted scoring to achieve 72.0% accuracy, an 11.6% improvement over the current state of the art.
Declarative Question Answering over Knowledge Bases containing Natural Language Text with Answer Set Programming
Mitra, Arindam, Clark, Peter, Tafjord, Oyvind, Baral, Chitta
While in recent years machine learning (ML) based approaches have been the popular approach in developing end-to-end question answering systems, such systems often struggle when additional knowledge is needed to correctly answer the questions. Proposed alternatives involve translating the question and the natural language text to a logical representation and then use logical reasoning. However, this alternative falters when the size of the text gets bigger. To address this we propose an approach that does logical reasoning over premises written in natural language text. The proposed method uses recent features of Answer Set Programming (ASP) to call external NLP modules (which may be based on ML) which perform simple textual entailment. To test our approach we develop a corpus based on the life cycle questions and showed that Our system achieves up to $18\%$ performance gain when compared to standard MCQ solvers.
Incremental and Iterative Learning of Answer Set Programs from Mutually Distinct Examples
Mitra, Arindam, Baral, Chitta
Over these years the Artificial Intelligence (AI) community has produced several datasets which have given the machine learning algorithms the opportunity to learn various skills across various domains. However, a subclass of these machine learning algorithms that aimed at learning logic programs, namely the Inductive Logic Programming algorithms, have often failed at the task due to the vastness of these datasets. This has impacted the usability of knowledge representation and reasoning techniques in the development of AI systems. In this research, we try to address this scalability issue for the algorithms that learn Answer Set Programs. We present a sound and complete algorithm which takes the input in a slightly different manner and perform an efficient and more user controlled search for a solution. We show via experiments that our algorithm can learn from two popular datasets from machine learning community, namely bAbl (a question answering dataset) and MNIST (a dataset for handwritten digit recognition), which to the best of our knowledge was not previously possible. The system is publicly available at https://goo.gl/KdWAcV.
Addressing a Question Answering Challenge by Combining Statistical Methods with Inductive Rule Learning and Reasoning
Mitra, Arindam (Arizona State University) | Baral, Chitta (Arizona State University)
A group of researchers from Facebook has recently proposed a set of 20 question-answering tasks (Facebook's bAbl dataset) as a challenge for the natural language understanding ability of an intelligent agent. These tasks are designed to measure various skills of an agent, such as: fact based question-answering, simple induction, the ability to find paths, co-reference resolution and many more. Their goal is to aid in the development of systems that can learn to solve such tasks and to allow a proper evaluation of such systems. They show existing systems cannot fully solve many of those toy tasks. In this work, we present a system that excels at all the tasks except one. The proposed model of the agent uses the Answer Set Programming (ASP) language as the primary knowledge representation and reasoning language along with the standard statistical Natural Language Processing (NLP) models. Given a training dataset containing a set of narrations, questions and their answers, the agent jointly uses a translation system, an Inductive Logic Programming algorithm and Statistical NLP methods to learn the knowledge needed to answer similar questions. Our results demonstrate that the introduction of a reasoning module significantly improves the performance of an intelligent agent.