Aligning where to see and what to tell: image caption with region-based attention and scene factorization

Open in new window