4496bf24afe7fab6f046bf4923da8de6-AuthorFeedback.pdf
–Neural Information Processing Systems
Inearly experiments exploring potential benchmark tasks, we8 found that BERT generally performs well on existing datasets for this type of task, likely because the pretraining9 procedure explicitly accounts for this format (e.g., via the segment IDs). However, we suspect that there is much more research to be done on how best to realize this transfer, particularly17 with only a small number of samples from which to adapt. We observe that the GLUE tasks where humans still18 substantially outperform models aretasks withtheleast amounts ofdata. Intermsofmeasuring specific linguistic capabilities ofsystems, weprovideadiagnostic dataset (AX)aimed togive22 users a focused analysis of their systems' language understand abilities. Each example in the diagnostic has expert23 labels of what types of natural language phenomena are present. Therefore, skill overlap across tasks is useful because43 we want to test whether systems can perform these high-level abilities despite surface variation between tasks.
Neural Information Processing Systems
Feb-12-2026, 01:01:39 GMT