Text-to-Image Generation Grounded by Fine-Grained User Attention