Collaborating Authors

Student Performance

Grading on a curve? Why AI systems test brilliantly but stumble in real life -


The headline in early 2018 was a shocker: "Robots are better at reading than humans." Two artificial intelligence systems, one from Microsoft and the other from Alibaba, had scored slightly higher than humans on Stanford's widely used test of reading comprehension. The test scores were real, but the conclusion was wrong. As Robin Jia and Percy Liang of Stanford showed a few months later, the "robots" were only better than humans at taking that specific test. Because they had trained themselves on readings that were similar to those on the test.

Challenge Closed-book Science Exam: A Meta-learning Based Question Answering System Artificial Intelligence

Prior work in standardized science exams requires support from large text corpus, such as targeted science corpus from Wikipedia or SimpleWikipedia. However, retrieving knowledge from the large corpus is time-consuming and questions embedded in complex semantic representation may interfere with retrieval. Inspired by the dual process theory in cognitive science, we propose a MetaQA framework, where system 1 is an intuitive meta-classifier and system 2 is a reasoning module. Specifically, our method based on meta-learning method and large language model BERT, which can efficiently solve science problems by learning from related example questions without relying on external knowledge bases. We evaluate our method on AI2 Reasoning Challenge (ARC), and the experimental results show that meta-classifier yields considerable classification performance on emerging question types. The information provided by meta-classifier significantly improves the accuracy of reasoning module from 46.6% to 64.2%, which has a competitive advantage over retrieval-based QA methods.

From Machine Reading Comprehension to Dialogue State Tracking: Bridging the Gap Artificial Intelligence

Dialogue state tracking (DST) is at the heart of task-oriented dialogue systems. However, the scarcity of labeled data is an obstacle to building accurate and robust state tracking systems that work across a variety of domains. Existing approaches generally require some dialogue data with state information and their ability to generalize to unknown domains is limited. In this paper, we propose using machine reading comprehension (RC) in state tracking from two perspectives: model architectures and datasets. We divide the slot types in dialogue state into categorical or extractive to borrow the advantages from both multiple-choice and span-based reading comprehension models. Our method achieves near the current state-of-the-art in joint goal accuracy on MultiWOZ 2.1 given full training data. More importantly, by leveraging machine reading comprehension datasets, our method outperforms the existing approaches by many a large margin in few-shot scenarios when the availability of in-domain data is limited. Lastly, even without any state tracking data, i.e., zero-shot scenario, our proposed approach achieves greater than 90% average slot accuracy in 12 out of 30 slots in MultiWOZ 2.1.

Variational Question-Answer Pair Generation for Machine Reading Comprehension Artificial Intelligence

We present a deep generative model of question-answer (QA) pairs for machine reading comprehension. We introduce two independent latent random variables into our model in order to diversify answers and questions separately. We also study the effect of explicitly controlling the KL term in the variational lower bound in order to avoid the "posterior collapse" issue, where the model ignores latent variables and generates QA pairs that are almost the same. Our experiments on SQuAD v1.1 showed that variational methods can aid QA pair modeling capacity, and that the controlled KL term can significantly improve diversity while generating high-quality questions and answers comparable to those of the existing systems.

R3: A Reading Comprehension Benchmark Requiring Reasoning Processes Artificial Intelligence

Existing question answering systems can only predict answers without explicit reasoning processes, which hinder their explainability and make us overestimate their ability of understanding and reasoning over natural language. In this work, we propose a novel task of reading comprehension, in which a model is required to provide final answers and reasoning processes. To this end, we introduce a formalism for reasoning over unstructured text, namely Text Reasoning Meaning Representation (TRMR). TRMR consists of three phrases, which is expressive enough to characterize the reasoning process to answer reading comprehension questions. We develop an annotation platform to facilitate TRMR's annotation, and release the R3 dataset, a \textbf{R}eading comprehension benchmark \textbf{R}equiring \textbf{R}easoning processes. R3 contains over 60K pairs of question-answer pairs and their TRMRs. Our dataset is available at: \url{http://anonymous}.

GenNet : Reading Comprehension with Multiple Choice Questions using Generation and Selection model Artificial Intelligence

Multiple-choice machine reading comprehension is difficult task as its required machines to select the correct option from a set of candidate or possible options using the given passage and question.Reading Comprehension with Multiple Choice Questions task,required a human (or machine) to read a given passage, question pair and select the best one option from n given options. There are two different ways to select the correct answer from the given passage. Either by selecting the best match answer to by eliminating the worst match answer. Here we proposed GenNet model, a neural network-based model. In this model first we will generate the answer of the question from the passage and then will matched the generated answer with given answer, the best matched option will be our answer. For answer generation we used S-net (Tan et al., 2017) model trained on SQuAD and to evaluate our model we used Large-scale RACE (ReAding Comprehension Dataset From Examinations) (Lai et al.,2017).

Undersensitivity in Neural Reading Comprehension Machine Learning

Current reading comprehension models generalise well to in-distribution test sets, yet perform poorly on adversarially selected inputs. Most prior work on adversarial inputs studies oversensitivity: semantically invariant text perturbations that cause a model's prediction to change when it should not. In this work we focus on the complementary problem: excessive prediction undersensitivity, where input text is meaningfully changed but the model's prediction does not, even though it should. We formulate a noisy adversarial attack which searches among semantic variations of the question for which a model erroneously predicts the same answer, and with even higher probability. Despite comprising unanswerable questions, both SQuAD2.0 and NewsQA models are vulnerable to this attack. This indicates that although accurate, models tend to rely on spurious patterns and do not fully consider the information specified in a question. We experiment with data augmentation and adversarial training as defences, and find that both substantially decrease vulnerability to attacks on held out data, as well as held out attack spaces. Addressing undersensitivity also improves results on AddSent and AddOneSent, and models furthermore generalise better when facing train/evaluation distribution mismatch: they are less prone to overly rely on predictive cues present only in the training set, and outperform a conventional model by as much as 10.9% F1.

Densely Connected Attention Propagation for Reading Comprehension

Neural Information Processing Systems

We propose DecaProp (Densely Connected Attention Propagation), a new densely connected neural architecture for reading comprehension (RC). There are two distinct characteristics of our model. Firstly, our model densely connects all pairwise layers of the network, modeling relationships between passage and query across all hierarchical levels. Secondly, the dense connectors in our network are learned via attention instead of standard residual skip-connectors. To this end, we propose novel Bidirectional Attention Connectors (BAC) for efficiently forging connections throughout the network.

ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning Artificial Intelligence

Recent powerful pre-trained language models have achieved remarkable performance on most of the popular datasets for reading comprehension. It is time to introduce more challenging datasets to push the development of this field towards more comprehensive reasoning of text. In this paper, we introduce a new Reading Comprehension dataset requiring logical reasoning (ReClor) extracted from standardized graduate admission examinations. As earlier studies suggest, human-annotated datasets usually contain biases, which are often exploited by models to achieve high accuracy without truly understanding the text. In order to comprehensively evaluate the logical reasoning ability of models on ReClor, we propose to identify biased data points and separate them into EASY set while the rest as HARD set. Empirical results show that state-of-the-art models have an outstanding ability to capture biases contained in the dataset with high accuracy on EASY set. However, they struggle on HARD set with poor performance near that of random guess, indicating more research is needed to essentially enhance the logical reasoning ability of current models. 1

AI in education: Using ed tech to save teachers time and reduce workloads


For much of the previous decade, advocates of education technology imagined a classroom where computer algorithms would differentiate instruction for each student, delivering just the right lessons at the right time, like a personal tutor. The evidence that students learn better this way has not been strong and, instead, we're reading reports that technology use at school sometimes hurts student achievement. So it was interesting to see McKinsey & Co., an elite consulting firm, reframe the argument for buying education technology away from computerized instruction to something more pedestrian: saving teachers time. A January 2020 report by the firm estimated that between 20 and 40 percent of the 50 hours that a typical teacher currently works a week could be saved through existing automation technology, often enabled by artificial intelligence (AI). That adds up to 13 saved hours a week, hours of freedom that could help relieve teacher burnout.