More comparison with related interpretable methods (R1, R2, R3): In our paper, we discussed the main difference

Neural Information Processing Systems 

CNN (Zheng et al., ICCV 2017) uses aggregated conv-feature maps as "part attentions."