image region
Adaptively Aligned Image Captioning via Adaptive Attention Time
Lun Huang, Wenmin Wang, Yaxian Xia, Jie Chen
AATallowstheframeworktolearn howmany attention steps to take to output a caption word at each decoding step. With AAT, an image region can be mapped to an arbitrary number of caption words while a caption word can also attend to an arbitrary number of image regions. AAT is deterministic and differentiable, and doesn't introduce any noise to the parameter gradients.