A Appendix Overview

Neural Information Processing Systems 

We add position embeddings and three kinds of token type embeddings (i.e., word token, context patch token, region patch token) to them. We then apply three layers of transformer blocks to jointly encode the input sequence and take the output [CLS] token to predict the Shapley interaction estimation and corresponding uncertainty, separately.