ON EVALUAmNG AI SYSTEMS FOR MEDICAL DIAGNOSIS
Among the difficulties in evaluating AItype medical diagnosis systems are: the intermediate conclusions of the AI system need to be looked at in addition to the "final" answer; the "superhuman human" fallacy must be resisted; both pro-and anti-computer biases during evaluation must be guarded against; and methods for estimating how the approach will scale upwards to larger domains are needed We propose a type of Turing test for the evaluation problem, designed to provide some protection against the problems listed above We propose to measure both the accuracy of diagnosis and the structure of reasoning, the latter with a view to gauging how well the system will scale up A staple of many of the evaluations of AI systems that have so far been conducted (Colby, Hilf, Weber, 81 Kraemer, 1972; Yu et al, 1979) is a central idea from a well-known proposal to evaluate AI systems: The Turing Test (Turing, 1963) The meat of the idea is to see if a neutral observer, given a set of performances on a task, some by a machine and others by humans, but unlabelled as to authorship, could identify, better than chance, which were machine and which were human-produced. Note that this really attempts to answer the question, "DO we know how to design a machine to perform a task which until now required human intelligence?", The latter question subsumes the former in a sense: because the machine not performing well in comparison to a human would presumably increase the cost significantly. In this paper I follow tradition and consider the evaluation of AI systems for medical diagnosis from the viewpoint of the first question above. The proposed procedure is also a variant of Turing's Test.
Jan-4-2018, 14:50:20 GMT
- Industry:
- Technology: