Aligning where to see and what to tell: image caption with region-based attention and scene factorization