Review for NeurIPS paper: Benchmarking Deep Learning Interpretability in Time Series Predictions

Neural Information Processing Systems 

This work introduces a bunch of benchmarks for evaluating time series saliency methods (with respective metrics). The authors do a number of empirical evaluations, draw some conclusions about why certain things don't work, and propose a new saliency method based on that. There are a number of things that I like about this work and that was pointed out by the reviewers as well: there is a definite lack of datasets with groundtruth saliency in them so coming up with such a dataset (and associated metrics) is a worthy contribution by itself (though perhaps not rising up to the bar of acceptance at NeurIPS). In general, everyone agreed that this part of the paper is good. What was more controversial: is the subsequent analysis interesting and novel enough?