Reviews: Semantic-Guided Multi-Attention Localization for Zero-Shot Learning

Neural Information Processing Systems 

The problem is relevant and the method is based on an interesting attention based idea to look at different regions in the image for the task of ZSL The losses used focus on (i) making each attention map peaky, while making different maps diverse, (ii) embedding based softmax for better prediction and (iii) class center triplet loss which makes the features closer to their respective class centers relative to the other class centers. Line 190 mentions that the image and parts are sent to "separate backbone networks", which implies that the network parameters are not shared. If that is the case then the method will have 3x parameters cf competing methods ie. a significantly higher capacity network overall. What happens when the CNN params are shared? And what happens when the image only baseline has a higher capacity network backbone (which is also then end-to-end finetuned)?