Computers are built to process data, but there's a particular form of information so rich and dense in meaning that it's beyond the full comprehension of even the most advanced AI. It's also one that you and I process intuitively and deal in every day: language. Understanding the written and spoken word is a big an important challenge for computer scientists. This month, a small milestone was passed when a pair of teams from Microsoft and Alibaba independently created AI programs that can outperform humans in a reading comprehension test. As you might expect, this news resulted in a flurry of coverage.
Artificial intelligence (AI) from Alibaba and Microsoft beat the human score in a Stanford reading comprehension test, the companies announced separately on Monday. The Stanford Question Answering Dataset (SQuAD) uses a set of questions and answers about Wikipedia articles, according to our sister site CNET. Microsoft scored 82.65 and Alibaba's score was 82.44, both good for first place, but barely beat the human score of 82.304. The results, however slim the margin, suggest AI may be able to match or outperform humans in certain tasks. As the field develops, this margin will most likely increase, potentially allowing AI to be smart enough to take over certain jobs--possibly even high-level ones--and let humans focus on others.
When computer models designed by tech giants Alibaba and Microsoft this month surpassed humans for the first time in a reading-comprehension test, both companies celebrated the success as a historic milestone. Luo Si, the chief scientist for natural-language processing at Alibaba's AI research unit, struck a poetic note, saying, "Objective questions such as'what causes rain' can now be answered with high accuracy by machines." Teaching a computer to read has for decades been one of artificial intelligence's holiest grails, and the feat seemed to signal a coming future in which AI could understand words and process meaning with the same fluidity humans take for granted every day. But computers aren't there yet -- and aren't even really that close, said AI experts who reviewed the test results. Instead, the accomplishment highlights not just how far the technology has progressed, but also how far it still has to go.
Alibaba has developed an artificial intelligence model that scored better than humans in a Stanford University reading and comprehension test. Alibaba Group Holding (baba) put its deep neural network model through its paces last week, asking the AI to provide exact answers to more than 100,000 questions comprising a quiz that's considered one of the world's most authoritative machine-reading gauges. The model developed by Alibaba's Institute of Data Science of Technologies scored 82.44, edging past the 82.304 that rival humans achieved. Alibaba said it's the first time a machine has out-done a real person in such a contest. Microsoft achieved a similar feat, scoring 82.650 on the same test, but those results were finalized a day after Alibaba's, the company said.
Chinese retail giant Alibaba has developed an artificial intelligence model that's managed to outdo human participants in a reading and comprehension test designed by Stanford University. The model scored 82.44, whereas humans recorded a score of 82.304. The Stanford Question Answering Dataset is a set of 10,000 questions pertaining to some 500 Wikipedia articles. The answer to each question is a particular span of text from the corresponding piece of writing. Alibaba claims that its accomplishment is the first time that humans have been outmatched on this particular test, according to a report from Bloomberg.
Chinese artificial intelligence is now capable of outperforming humans in reading comprehension. A neural network model created by Chinese e-commerce giant Alibaba beat its flesh-and-blood competition on a 100,000-question Stanford University test that's considered the world's top measure of machine reading. The model, developed by Alibaba's Institute of Data Science of Technologies, scored 82.44, while humans scored a 82.304. Microsoft's artificial intelligence also beat humans, scoring 82.65 on the exam. But its results came in a day after Alibaba's, meaning China holds the title as first country to create automation that outranks humans in written language comprehension.
First, it was the AlphaGo AI from Google's DeepMind subsidiary which beat the world's best Go players at their own game to make a record. Then, an AI named Libratus, developed by the Carnegie Mellon University, outclassed Poker pros in a tournament to turn the world's attention towards the rapid pace at which AI is progressing. In the latest such example of an AI outsmarting human beings, a deep neural network model developed by Alibaba fared better than humans in a reading comprehension test. The AI model developed by Alibaba's Institute of Data Science and Technologies blazed past the SQuAD (Stanford Question Answering Dataset) test- one of the most reliable reading comprehension test for evaluating a machine's language skills- in a contest which pitted it against human rivals. Alibaba's AI scored a cumulative 82.44 Exact Match (EM) points, outscoring its human competitors who manged to put up 82.304 points on the scoreboard.
In this article, we describe a deployed educational technology application: the Criterion Online Essay Evaluation Service, a web-based system that provides automated scoring and evaluation of student essays. Criterion has two complementary applications: (1) Critique Writing Analysis Tools, a suite of programs that detect errors in grammar, usage, and mechanics, that identify discourse elements in the essay, and that recognize potentially undesirable elements of style, and (2) e-rater version 2.0, an automated essay scoring system. Critique and e-rater provide students with feedback that is specific to their writing in order to help them improve their writing skills and is intended to be used under the instruction of a classroom teacher. Both applications employ natural language processing and machine learning techniques. All of these capabilities outperform baseline algorithms, and some of the tools agree with human judges in their evaluations as often as two judges agree with each other. Unfortunately, this puts an enormous load on the classroom teacher, who is faced with reading and providing feedback for perhaps 30 essays or more every time a topic is assigned. As a result, teachers are not able to give writing assignments as often as they would wish. With this in mind, researchers have sought to develop applications that automate essay scoring and evaluation. Work in automated essay scoring began in the early 1960s and has been extremely productive (Page 1966; Burstein et al. 1998; Foltz, Kintsch, and Landauer 1998; Larkey 1998; Rudner 2002; Elliott 2003). Detailed descriptions of most of these systems appear in Shermis and Burstein (2003). Pioneering work in the related area of automated feedback was initiated in the 1980s with the Writer's Workbench (MacDonald et al. 1982). The Criterion Online Essay Evaluation Service combines automated essay scoring and diagnostic feedback. The feedback is specific to the student's essay and is based on the kinds of evaluations that teachers typically provide when grading a student's writing. Criterion is intended to be an aid, not a replacement, for classroom instruction. Its purpose is to ease the instructor's load, thereby enabling the instructor to give students more practice writing essays. Criterion contains two complementary applications that are based on natural language processing (NLP) methods. Critique is an application that is comprised of a suite of programs that evaluate and provide feedback for errors in grammar, usage, and mechanics, that identify the essay's discourse structure, and that recognize potentially undesirable stylistic features. The companion scoring application, e-rater version 2.0, extracts linguistically-based features from an essay and uses a statistical model of how these features are related to overall writing quality to assign a holistic score to the essay. Figure 1 shows Criterion's interface for submit-
In October, American teachers prevailed in a lawsuit with their school district over a computer program that assessed their performance. The system rated teachers in Houston by comparing their students' test scores against state averages. Those with high ratings won praise and even bonuses. Those who fared poorly faced the sack. The program did not please everyone.
From next year, NAPLAN persuasive writing tasks will be marked by an automated essay scoring system. Dr Perelman, a former director of writing at MIT, has published widely on writing assessment and was commissioned by the NSW Teacher's Federation to review a 2015 paper by the Australian Curriculum and Assessment Authority (ACARA) that concluded automated essay scoring was as effective, if not more so, than human markers. "ACARA's extensive research indicates automated marking is as reliable and valid as human marking," Dr Rabinowitz said. NSW Teacher's Federation acting president Gary Zadkovich called on ACARA to suspend its plan to introduce automated essay scoring.