Reviews: Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis

Neural Information Processing Systems 

This paper proposes a strongly conditional network for generating images from semantic maps. How impacted is this network by small changes in the input map - for example given 3 sequential frames of a video (as segmentation maps) - is the model consistent in assigning colors and structures? Or do small changes in the geometry of the semantic objects have a large impact on the output? This is mostly curiousity, as having smoothness inherent in the model has large potential for video applications. Some amount of qualitative results comparing to other models were shown, but showing the important regions of the input conditioning, and the influence of input perturbations on the model output could also lead to valuable insight - using something like GradCAM or related methods may be possible for checking the importance of input features.