On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods
Amarasinghe, Kasun, Rodolfa, Kit T., Jesus, Sérgio, Chen, Valerie, Balayan, Vladimir, Saleiro, Pedro, Bizarro, Pedro, Talwalkar, Ameet, Ghani, Rayid
–arXiv.org Artificial Intelligence
Evaluation studies frequently rely on simplified experimental settings with non-expert users (e.g., workers on Amazon Mechanical Turk), use proxy tasks (e.g., forward simulation), or use subjective, user-reported measures as metrics of explanation quality [9, 16, 18, 19, 25, 26, 31]. Such settings are not equipped to evaluate the real-world utility of explainable ML methods since proxy task performance does not reflect real-task performance [3], users' perception of explanation usefulness is not reflective of utility in a task [3, 17], and proxy users do not reflect how expert users would use explanations [1]. A few studies evaluate explainable ML methods on their intended deployment settings where domain expert users perform the intended task [10, 20] (dubbed application-grounded evaluation studies in [6]). However, even in those, we argue that experimental design flaws (e.g., not isolating the incremental impact of explanations in [20]) and seemingly trivial design choices that deviate experimental settings from the deployment context (e.g., using metrics that do not reflect the task objectives in [10]), limit the applicability of drawn conclusions. We elaborate on these limitations in Section 2. In this work, we seek to bridge this critical gap by conducting a study that evaluates explainable ML methods in a setting consistent with the intended deployment context. Our study builds on the e-commerce fraud detection setting used in a previous evaluation study [10] consisting of professional fraud analysts tasked with reviewing e-commerce transactions to detect fraud when the ML model is uncertain about the outcome. We identify several simplifying assumptions made by the previous study that deviated from the deployment context and modify the setup to relax those assumptions (summarized in Table 1 and discussed in detail in Section 3.2). These modifications make the experimental setup faithful to the deployment setting and equipped to evaluate the utility of the explainable ML methods considered. Our setup results in dramatically different conclusions of the relative utility of ML model scores and explanations compared to the earlier work [10].
arXiv.org Artificial Intelligence
Feb-21-2023
- Country:
- Europe (1.00)
- North America > United States
- California (0.28)
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Research Report
- Industry:
- Health & Medicine (1.00)
- Law Enforcement & Public Safety > Fraud (1.00)
- Technology: