ERASER: A Benchmark to Evaluate Rationalized NLP Models


Many NLP applications today deploy state-of-the-art deep neural networks that are essentially black-boxes. One of the goals of Explainable AI (XAI) is to have AI models reveal why and how they make their predictions so that these predictions are interpretable by a human. But work in this direction has been conducted on different datasets with correspondingly unique aims, and the inherent subjectivity in defining what constitutes'interpretability' has resulted in no standard way to evaluate performance. Interpretability can mean multiple things depending on the task and context. The Evaluating Rationales And Simple English Reasoning (ERASER) benchmark is the first ever effort to unify and standardize NLP tasks with the goal of interpretability.