MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts

Neural Information Processing Systems 

Text-to-image diffusion has attracted vast attention due to its impressive image-generation capabilities. However, when it comes to human-centric text-to-image generation, particularly in the context of faces and hands, the results often fall short of naturalness due to insufficient training priors. We alleviate the issue in this work from two perspectives.