a new benchmark for evaluating general-purpose NLU systems, which is necessary given the saturation of the GLUE

Neural Information Processing Systems 

We thank all the reviewers for their time and comments. Our work builds directly on GLUE and maintains the same general structure. Our benchmark does have a less uniform API than GLUE, but we view this as both a pro and a con. WSC is a coreference task but is designed to require commonsense reasoning to solve. COP A explicitly tests systems' causal reasoning ability (somewhat related to commonsense reasoning).

Similar Docs  Excel Report  more

TitleSimilaritySource
None found