One of the most crucial preprocessing steps in any machine learning project is feature encoding. It is the process of turning categorical data in a dataset into numerical data. It is essential that we perform feature encoding because most machine learning models can only interpret numerical data and not data in text form. As usual, I will demonstrate these concepts through a practical case study using the students' performance in exams dataset on Kaggle. You can find the complete notebook up on my GitHub here.
Khashabi, Daniel, Cohan, Arman, Shakeri, Siamak, Hosseini, Pedram, Pezeshkpour, Pouya, Alikhani, Malihe, Aminnaseri, Moin, Bitaab, Marzieh, Brahman, Faeze, Ghazarian, Sarik, Gheini, Mozhdeh, Kabiri, Arman, Mahabadi, Rabeeh Karimi, Memarrast, Omid, Mosallanezhad, Ahmadreza, Noury, Erfan, Raji, Shahab, Rasooli, Mohammad Sadegh, Sadeghi, Sepideh, Azer, Erfan Sadeqi, Samghabadi, Niloofar Safi, Shafaei, Mahsa, Sheybani, Saber, Tazarv, Ali, Yaghoobzadeh, Yadollah
Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this rich language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce ParsiNLU, the first benchmark in Persian language that includes a range of high-level tasks -- Reading Comprehension, Textual Entailment, etc. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5$k$ new instances across 6 distinct NLU tasks. Besides, we present the first results on state-of-the-art monolingual and multi-lingual pre-trained language-models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope ParsiNLU fosters further research and advances in Persian language understanding.
The AI and ML deployments are well underway, but for CXOs the biggest issue will be managing these initiatives, and figuring out where the data science team fits in and what algorithms to buy versus build. IBM said it is adding reading comprehension, the ability to extract information from FAQ documents, support for new languages and intent classification models to Watson. The new features outlined by IBM are designed to advance natural language processing and make it a bigger part of automation workflows. IBM has been looking to distinguish Watson with its natural language processing research efforts. For instance, IBM recently said it is commercializing its Project Debater technology that enables artificial intelligence to debate humans and handle complex topics.
Multi-choice Machine Reading Comprehension (MRC) is a major and challenging form of MRC tasks that requires model to select the most appropriate answer from a set of candidates given passage and question. Most of the existing researches focus on the modeling of the task datasets without explicitly referring to external fine-grained commonsense sources, which is a well-known challenge in multi-choice tasks. Thus we propose a novel reference-based knowledge enhancement model based on span extraction called Reference Knowledgeable Network (RekNet), which simulates human reading strategy to refine critical information from the passage and quote external knowledge in necessity. In detail, RekNet refines fine-grained critical information and defines it as Reference Span, then quotes external knowledge quadruples by the co-occurrence information of Reference Span and answer options. Our proposed method is evaluated on two multi-choice MRC benchmarks: RACE and DREAM, which shows remarkable performance improvement with observable statistical significance level over strong baselines.
Multiple-choice questions (MCQs) offer the most promising avenue for skill evaluation in the era of virtual education and job recruiting, where traditional performance-based alternatives such as projects and essays have become less viable, and grading resources are constrained. The automated generation of MCQs would allow assessment creation at scale. Recent advances in natural language processing have given rise to many complex question generation methods. However, the few methods that produce deployable results in specific domains require a large amount of domain-specific training data that can be very costly to acquire. Our work provides an initial foray into MCQ generation under high data-acquisition cost scenarios by strategically emphasizing paraphrasing the question context (compared to the task). In addition to maintaining semantic similarity between the question-answer pairs, our pipeline, which we call AGenT Zero, consists of only pre-trained models and requires no fine-tuning, minimizing data acquisition costs for question generation. AGenT Zero successfully outperforms other pre-trained methods in fluency and semantic similarity. Additionally, with some small changes, our assessment pipeline can be generalized to a broader question and answer space, including short answer or fill in the blank questions.
Answer validation in machine reading comprehension (MRC) consists of verifying an extracted answer against an input context and question pair. Previous work has looked at re-assessing the "answerability" of the question given the extracted answer. Here we address a different problem: the tendency of existing MRC systems to produce partially correct answers when presented with answerable questions. We explore the nature of such errors and propose a post-processing correction method that yields statistically significant performance improvements over state-of-the-art MRC systems in both monolingual and multilingual evaluation.
Explicit and implicit bias clouds human judgement, leading to discriminatory treatment of minority groups. A fundamental goal of algorithmic fairness is to avoid the pitfalls in human judgement by learning policies that improve the overall outcomes while providing fair treatment to protected classes. In this paper, we propose a causal framework that learns optimal intervention policies from data subject to fairness constraints. We define two measures of treatment bias and infer best treatment assignment that minimizes the bias while optimizing overall outcome. We demonstrate that there is a dilemma of balancing fairness and overall benefit; however, allowing preferential treatment to protected classes in certain circumstances (affirmative action) can dramatically improve the overall benefit while also preserving fairness. We apply our framework to data containing student outcomes on standardized tests and show how it can be used to design real-world policies that fairly improve student test scores. Our framework provides a principled way to learn fair treatment policies in real-world settings.
Cross-lingual Machine Reading Comprehension (CLMRC) remains a challenging problem due to the lack of large-scale annotated datasets in low-source languages, such as Arabic, Hindi, and Vietnamese. Many previous approaches use translation data by translating from a rich-source language, such as English, to low-source languages as auxiliary supervision. However, how to effectively leverage translation data and reduce the impact of noise introduced by translation remains onerous. In this paper, we tackle this challenge and enhance the cross-lingual transferring performance by a novel augmentation approach named Language Branch Machine Reading Comprehension (LBMRC). A language branch is a group of passages in one single language paired with questions in all target languages. We train multiple machine reading comprehension (MRC) models proficient in individual language based on LBMRC. Then, we devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages. Combining the LBMRC and multilingual distillation can be more robust to the data noises, therefore, improving the model's cross-lingual ability. Meanwhile, the produced single multilingual model is applicable to all target languages, which saves the cost of training, inference, and maintenance for multiple models. Extensive experiments on two CLMRC benchmarks clearly show the effectiveness of our proposed method.
It may be difficult to predict how well a student will perform academically, but a new innovation can do so just by looking at their tweets - and with more than 93 percent accuracy. A computer model trained on thousands of test scores and one million social media posts to distinguishing between high academic achievers and lower ones based on textual features shared in posts. The technology, powered by artificial intelligence, determined that students who discuss scientific and cultural topics, along with writing lengthy posts and words are likely to perform well. However, those who use an abundance of emojis, words or entire phrases written in in capital letters and vocabulary related to horoscopes, driving and military service tend to receive lower grades in school. The team notes that by'predict' they do not mean the system creates a future forecast, but rather a correlation between posts and real test scores students earned.
We propose a simple method to generate large amounts of multilingual question and answer pairs by a single generative model. These synthetic samples are then applied to augment the available gold multilingual ones to improve the performance of multilingual QA models on target languages. Our approach only requires existence of automatically translated samples from English to the target domain, thus removing the need for human annotations in the target languages. Experimental results show our proposed approach achieves significant gains in a number of multilingual datasets.