Advances in artificial intelligence depend on continual testing of massive amounts of data. This benchmark testing allows researchers to determine how "intelligent" AI is, spot weaknesses and then develop stronger, smarter models. The process, however, is time-consuming. When an AI system tackles a series of computer-generated tasks and eventually reaches peak performance, researchers must go back to the drawing board and design newer, more complex projects to further bolster AI's performance. Facebook announced this week it has found a better tool to undertake this task--people.
Benchmarking is a crucial step in developing ever more sophisticated artificial intelligence. It provides a helpful abstraction of the AI's capabilities and allows researchers a firm sense of how well the system is performing on specific tasks. But they are not without their drawbacks. Once an algorithm masters the static dataset from a given benchmark, researchers have to undertake the time-consuming process of developing a new one to further improve the AI. As AIs have improved over time, researchers have had to build new benchmarks with increasing frequency.
Benchmarks can be very misleading, says Douwe Kiela at Facebook AI Research, who led the team behind the tool. Focusing too much on benchmarks can mean losing sight of wider goals. The test can become the task. "You end up with a system that is better at the test than humans are but not better at the overall task," he says. "It's very deceiving, because it makes it look like we're much further than we actually are."
We've built and are now sharing Dynabench, a first-of-its-kind platform for dynamic data collection and benchmarking in artificial intelligence. It uses both humans and models together "in the loop" to create challenging new data sets that will lead to better, more flexible AI. Dynabench radically rethinks AI benchmarking, using a novel procedure called dynamic adversarial data collection to evaluate AI models. It measures how easily AI systems are fooled by humans, which is a better indicator of a model's quality than current static benchmarks provide. Ultimately, this metric will better reflect the performance of AI models in the circumstances that matter most: when interacting with people, who behave and react in complex, changing ways that can't be reflected in a fixed set of data points.
In the ever-expanding world of computer hardware and software, benchmarks provide a robust method for comparing quality and performance across different system architectures. From MNIST to ImageNet to GLUE, benchmarks have also come to play a hugely important role in driving and measuring progress in AI research. When introducing any new benchmark, it's generally best not to make it so easy that it will quickly become outdated, or so hard that everyone will simply fail. When new models bury benchmarks, which is happening faster and faster in AI these days, researchers must engage in the time-consuming work of making new ones. Facebook believes that the increasing benchmark saturation in recent years -- especially in natural language processing (NLP) -- means it's time to "radically rethink the way AI researchers do benchmarking and to break free of the limitations of static benchmarks." Their solution is a new research platform for dynamic data collection and benchmarking called Dynabench, which they propose will offer a more accurate and sustainable way for evaluating progress in AI.