Comparison with a human is an essential requirement for a benchmark for it to be a reliable measurement of model capabilities. Nevertheless, the methods for model comparison could have a fundamental flaw - the arithmetic mean of separate metrics is used for all tasks of different complexity, different size of test and training sets. In this paper, we examine popular NLP benchmarks' overall scoring methods and rearrange the models by geometric and harmonic mean (appropriate for averaging rates) according to their reported results. We analyze several popular benchmarks including GLUE, SuperGLUE, XGLUE, and XTREME. The analysis shows that e.g.
As a gentle rain fell, Woods hit a low hook around the trees on the 14th hole and onto the green. He walked toward the fairway to see how the shot landed, when a Georgia Bureau of Investigation officer ran down a small slope to help control the spectators behind Woods. The officer lost his footing on the rain-slickened grass, sliding into Woods' right foot, making him stumble.
Models that top leaderboards often perform unsatisfactorily when deployed in real world applications; this has necessitated rigorous and expensive pre-deployment model testing. A hitherto unexplored facet of model performance is: Are our leaderboards doing equitable evaluation? In this paper, we introduce a task-agnostic method to probe leaderboards by weighting samples based on their `difficulty' level. We find that leaderboards can be adversarially attacked and top performing models may not always be the best models. We subsequently propose alternate evaluation metrics. Our experiments on 10 models show changes in model ranking and an overall reduction in previously reported performance -- thus rectifying the overestimation of AI systems' capabilities. Inspired by behavioral testing principles, we further develop a prototype of a visual analytics tool that enables leaderboard revamping through customization, based on an end user's focus area. This helps users analyze models' strengths and weaknesses, and guides them in the selection of a model best suited for their application scenario. In a user study, members of various commercial product development teams, covering 5 focus areas, find that our prototype reduces pre-deployment development and testing effort by 41% on average.
Total dataset size has increased overall, from 550,230 total polygons to 850,736 total polygons and for total area from 19,804 square kilometers to 45,361 square kilometers. The dataset was announced at IEEE CVPR 2019 (most up to date metrics are accurate at the website above however). The dataset creation was led by the Defense Innovation Unit with the technical expertise of Carnegie Mellon's Software Engineering Institute (CMU SEI), CrowdAI and the Joint Artificial Intelligence Center, with data provided by MAXAR's Open Data Program. Our leaderboard has also been launched on our Challenge page - you need to be logged in and click on the "Leaderboard" tab to see results and you can make submissions as well. You can find the baseline and the metrics code on GitHub, also here is a Docker link for the baseline.