How should we evaluate progress in AI?
The evaluation question is inseparable from questions about what sort of thing AI is--and both are inseparable from questions about how best to do it. Most intellectual disciplines have standard, unquestioned criteria for what counts as progress. Artificial intelligence is an exception. This has always caused trouble. The diverse evaluation criteria are incommensurable. They suggest divergent directions for research. They produce sharp disagreements about what methods to apply, which results are important, and how well the field is progressing. Can't AI make up its mind about what it is trying to do? Can't it just decide to be something respectable--science or engineering--and use a coherent set of evaluation criteria drawn from one of those disciplines? That doesn't seem to be possible. AI is unavoidably a wolpertinger, stitched together from bits of other disciplines. It's rarely possible to evaluate specific AI projects according to the criteria of a single one of them. This post offers a framework for thinking about what makes the AI wolpertinger fly. The framework is, so to speak, parameterized: it accommodates differing perspectives on the relative value of criteria from the six disciplines, and their role in AI research. How they are best combined is a judgement call, differing according to the observer and the project observed. Nevertheless, one can make cogent arguments in favor of weighting particular criteria more or less heavily.1 Choices about how to evaluate AI lead to choices about what problems to address, what approaches to take, and what methods to apply. I will advocate improving AI practice through greater use of scientific experimentation; pursuit particularly of philosophically interesting questions; better understanding of design practice; and greater care in creating spectacular demos. Follow-on posts will explain these points in more detail. This framework is meant mainly for AI participants.
Jul-16-2018, 17:41:06 GMT