Supplementary Material of A Unified Conditional Framework for Diffusion-based Image Restoration

Neural Information Processing Systems 

For all tasks, we adopt a UNet architecture similar to the one described in DvSR [4]. The input feature map is expanded to 64 channels. There are five stages in both the encoder and decoder, and each stage contains two diffusion model blocks. Between each encoder stage, the input resolution is downsampled by a convolution layer with stride 2 and the channels are expanded by a factor of 2. On the other hand, in each decoder stage, the feature map resolution and the channels are reversed by the Nearest upsampling and a convolution layer separately. During training, we use a linear noise schedule with a total of T = 2000 steps.