We've built and are now sharing Dynabench, a first-of-its-kind platform for dynamic data collection and benchmarking in artificial intelligence. It uses both humans and models together "in the loop" to create challenging new data sets that will lead to better, more flexible AI. Dynabench radically rethinks AI benchmarking, using a novel procedure called dynamic adversarial data collection to evaluate AI models. It measures how easily AI systems are fooled by humans, which is a better indicator of a model's quality than current static benchmarks provide. Ultimately, this metric will better reflect the performance of AI models in the circumstances that matter most: when interacting with people, who behave and react in complex, changing ways that can't be reflected in a fixed set of data points.
"Unless you have confidence in the ruler's reliability, if you use a ruler to measure a table, you may also be using the table to measure the ruler." Do machine learning researchers solve something huge every time they hit the benchmark? If not, then why do we have these benchmarks? But, if the benchmark is breached every couple of months then research objectives might become more about chasing benchmarks than solving bigger problems. In order to address these challenges, researchers at Facebook AI have introduced Dynabench, a new platform for dynamic data collection and benchmarking.
Benchmarking is a crucial step in developing ever more sophisticated artificial intelligence. It provides a helpful abstraction of the AI's capabilities and allows researchers a firm sense of how well the system is performing on specific tasks. But they are not without their drawbacks. Once an algorithm masters the static dataset from a given benchmark, researchers have to undertake the time-consuming process of developing a new one to further improve the AI. As AIs have improved over time, researchers have had to build new benchmarks with increasing frequency.
Advances in artificial intelligence depend on continual testing of massive amounts of data. This benchmark testing allows researchers to determine how "intelligent" AI is, spot weaknesses and then develop stronger, smarter models. The process, however, is time-consuming. When an AI system tackles a series of computer-generated tasks and eventually reaches peak performance, researchers must go back to the drawing board and design newer, more complex projects to further bolster AI's performance. Facebook announced this week it has found a better tool to undertake this task--people.
Benchmarks can be very misleading, says Douwe Kiela at Facebook AI Research, who led the team behind the tool. Focusing too much on benchmarks can mean losing sight of wider goals. The test can become the task. "You end up with a system that is better at the test than humans are but not better at the overall task," he says. "It's very deceiving, because it makes it look like we're much further than we actually are."