We Must Stop Comparing Deep Learning's Real Accuracy To Nonexistent Human Perfection


As deep learning has become ubiquitous, evaluations of its accuracy typically compare its performance against an idealized baseline of flawless human results that bear no resemblance to the actual human workflow those algorithms are being designed to replace. For example, the accuracy of real-time algorithmic speech recognition is frequently compared against human captioning produced in offline multi-coder reconciled environments and subjected to multiple reviews to generate flawless content that looks absolutely nothing like actual real-time human transcription. If we really wish to understand the usability of AI today we should be comparing it against the human workflows it is designed to replace, not an impossible vision of nonexistent human perfection. While the press is filled with the latest superhuman exploits of bleeding-edge research AI systems besting humans at yet another task, the reality of production AI systems is far more mundane. Most commercial applications of deep learning can achieve higher accuracy than their human counterparts at some tasks and worse performance on others.