Data-driven predictive models are increasingly used in education to support students, instructors, and administrators. However, there are concerns about the fairness of the predictions and uses of these algorithmic systems. In this introduction to algorithmic fairness in education, we draw parallels to prior literature on educational access, bias, and discrimination, and we examine core components of algorithmic systems (measurement, model learning, and action) to identify sources of bias and discrimination in the process of developing and deploying these systems. Statistical, similarity-based, and causal notions of fairness are reviewed and contrasted in the way they apply in educational contexts. Recommendations for policy makers and developers of educational technology offer guidance for how to promote algorithmic fairness in education.
This paper presents a novel method to generate answers for non-extraction machine reading comprehension (MRC) tasks whose answers cannot be simply extracted as one span from the given passages. Using a pointer network-style extractive decoder for such type of MRC may result in unsatisfactory performance when the ground-truth answers are given by human annotators or highly re-paraphrased from parts of the passages. On the other hand, using generative decoder cannot well guarantee the resulted answers with well-formed syntax and semantics when encountering long sentences. Therefore, to alleviate the obvious drawbacks of both sides, we propose an answer making-up method from extracted multi-spans that are learned by our model as highly confident $n$-gram candidates in the given passage. That is, the returned answers are composed of discontinuous multi-spans but not just one consecutive span in the given passages anymore. The proposed method is simple but effective: empirical experiments on MS MARCO show that the proposed method has a better performance on accurately generating long answers, and substantially outperforms two competitive typical one-span and Seq2Seq baseline decoders.
Machine Reading Comprehension (MRC) aims to extract answers to questions given a passage. It has been widely studied recently, especially in open domains. However, few efforts have been made on closed-domain MRC, mainly due to the lack of large-scale training data. In this paper, we introduce a multi-target MRC task for the medical domain, whose goal is to predict answers to medical questions and the corresponding support sentences from medical information sources simultaneously, in order to ensure the high reliability of medical knowledge serving. A high-quality dataset is manually constructed for the purpose, named Multi-task Chinese Medical MRC dataset (CMedMRC), with detailed analysis conducted. We further propose the Chinese medical BERT model for the task (CMedBERT), which fuses medical knowledge into pre-trained language models by the dynamic fusion mechanism of heterogeneous features and the multi-task learning strategy. Experiments show that CMedBERT consistently outperforms strong baselines by fusing context-aware and knowledge-aware token representations.
The Managed Care system within Medicaid (US Healthcare) uses Request For Proposals (RFP) to award contracts for various healthcare and related services. RFP responses are very detailed documents (hundreds of pages) submitted by competing organisations to win contracts. Subject matter expertise and domain knowledge play an important role in preparing RFP responses along with analysis of historical submissions. Automated analysis of these responses through Natural Language Processing (NLP) systems can reduce time and effort needed to explore historical responses, and assisting in writing better responses. Our work draws parallels between scoring RFPs and essay scoring models, while highlighting new challenges and the need for interpretability. Typical scoring models focus on word level impacts to grade essays and other short write-ups. We propose a novel Bi-LSTM based regression model, and provide deeper insight into phrases which latently impact scoring of responses. We contend the merits of our proposed methodology using extensive quantitative experiments. We also qualitatively asses the impact of important phrases using human evaluators. Finally, we introduce a novel problem statement that can be used to further improve the state of the art in NLP based automatic scoring systems.
A significant progress has been made in deep-learning based Automatic Essay Scoring (AES) systems in the past two decades. The performance commonly measured by the standard performance metrics like Quadratic Weighted Kappa (QWK), and accuracy points to the same. However, testing on common-sense adversarial examples of these AES systems reveal their lack of natural language understanding capability. Inspired by common student behaviour during examinations, we propose a task agnostic adversarial evaluation scheme for AES systems to test their natural language understanding capabilities and overall robustness.
Machine Reading Comprehension (MRC) is a challenging NLP research field with wide real world applications. The great progress of this field in recent years is mainly due to the emergence of large-scale datasets and deep learning. At present, a lot of MRC models have already surpassed the human performance on many datasets despite the obvious giant gap between existing MRC models and genuine human-level reading comprehension. This shows the need of improving existing datasets, evaluation metrics and models to move the MRC models toward 'real' understanding. To address this lack of comprehensive survey of existing MRC tasks, evaluation metrics and datasets, herein, (1) we analyzed 57 MRC tasks and datasets; proposed a more precise classification method of MRC tasks with 4 different attributes (2) we summarized 9 evaluation metrics of MRC tasks and (3) 7 attributes and 10 characteristics of MRC datasets; (4) We also discussed some open issues in MRC research and highlight some future research directions. In addition, to help the community, we have collected, organized, and published our data on a companion website(https://mrc-datasets.github.io/) where MRC researchers could directly access each MRC dataset, papers, baseline projects and browse the leaderboard.
Numerical reasoning is often important to accurately understand the world. Recently, several format-specific datasets have been proposed, such as numerical reasoning in the settings of Natural Language Inference (NLI), Reading Comprehension (RC), and Question Answering (QA). Several format-specific models and architectures in response to those datasets have also been proposed. However, there exists a strong need for a benchmark which can evaluate the abilities of models, in performing question format independent numerical reasoning, as (i) the numerical reasoning capabilities we want to teach are not controlled by question formats, (ii) for numerical reasoning technology to have the best possible application, it must be able to process language and reason in a way that is not exclusive to a single format, task, dataset or domain. In pursuit of this goal, we introduce NUMBERGAME, a multifaceted benchmark to evaluate model performance across numerical reasoning tasks of eight diverse formats. We add four existing question types in our compilation. Two of the new types we add are about questions that require external numerical knowledge, commonsense knowledge and domain knowledge. For building a more practical numerical reasoning system, NUMBERGAME demands four capabilities beyond numerical reasoning: (i) detecting question format directly from data (ii) finding intermediate common format to which every format can be converted (iii) incorporating commonsense knowledge (iv) handling data imbalance across formats. We build several baselines, including a new model based on knowledge hunting using a cheatsheet. However, all baselines perform poorly in contrast to the human baselines, indicating the hardness of our benchmark. Our work takes forward the recent progress in generic system development, demonstrating the scope of these under-explored tasks.
Machine reading comprehension (MRC) aims to teach machines to read and comprehend human languages, which is a long-standing goal of natural language processing (NLP). With the burst of deep neural networks and the evolution of contextualized language models (CLMs), the research of MRC has experienced two significant breakthroughs. MRC and CLM, as a phenomenon, have a great impact on the NLP community. In this survey, we provide a comprehensive and comparative review on MRC covering overall research topics about 1) the origin and development of MRC and CLM, with a particular focus on the role of CLMs; 2) the impact of MRC and CLM to the NLP community; 3) the definition, datasets, and evaluation of MRC; 4) general MRC architecture and technical methods in the view of two-stage Encoder-Decoder solving architecture from the insights of the cognitive process of humans; 5) previous highlights, emerging topics, and our empirical analysis, among which we especially focus on what works in different periods of MRC researches. We propose a full-view categorization and new taxonomies on these topics. The primary views we have arrived at are that 1) MRC boosts the progress from language processing to understanding; 2) the rapid improvement of MRC systems greatly benefits from the development of CLMs; 3) the theme of MRC is gradually moving from shallow text matching to cognitive reasoning.
Human tackle reading comprehension not only based on the given context itself but often rely on the commonsense beyond. To empower the machine with commonsense reasoning, in this paper, we propose a Commonsense Evidence Generation and Injection framework in reading comprehension, named CEGI. The framework injects two kinds of auxiliary commonsense evidence into comprehensive reading to equip the machine with the ability of rational thinking. Specifically, we build two evidence generators: the first generator aims to generate textual evidence via a language model; the other generator aims to extract factual evidence (automatically aligned text-triples) from a commonsense knowledge graph after graph completion. Those evidences incorporate contextual commonsense and serve as the additional inputs to the model. Thereafter, we propose a deep contextual encoder to extract semantic relationships among the paragraph, question, option, and evidence. Finally, we employ a capsule network to extract different linguistic units (word and phrase) from the relations, and dynamically predict the optimal option based on the extracted units. Experiments on the CosmosQA dataset demonstrate that the proposed CEGI model outperforms the current state-of-the-art approaches and achieves the accuracy (83.6%) on the leaderboard.
Multilingual pre-trained models could leverage the training data from a rich source language (such as English) to improve performance on low resource languages. However, the transfer quality for multilingual Machine Reading Comprehension (MRC) is significantly worse than sentence classification tasks mainly due to the requirement of MRC to detect the word level answer boundary. In this paper, we propose two auxiliary tasks in the fine-tuning stage to create additional phrase boundary supervision: (1) A mixed MRC task, which translates the question or passage to other languages and builds cross-lingual question-passage pairs; (2) A language-agnostic knowledge masking task by leveraging knowledge phrases mined from web. Besides, extensive experiments on two cross-lingual MRC datasets show the effectiveness of our proposed approach.