Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning
Guo, Dandan, Lu, Ruiying, Chen, Bo, Zeng, Zequn, Zhou, Mingyuan
Describing visual content in a natural-language utterance is an emerging interdisciplinary problem, which lies at the intersection of computer vision (CV) and natural language processing (NLP) ((1)). As a sentence-level short image caption ((2, 3, 4)) has a limited descriptive capacity, (5) introduce a paragraphlevel caption method that aims to generate a detailed and coherent paragraph for describing an image in a finer manner. Recent advances in image paragraph generation focus on building different types of hierarchical recurrent neural network (HRNN), e.g., LSTM ((6)), to generate the visual paragraphs. For HRNN, the high-level RNN recursively produces a sequence of sentence-level topic vectors given the image features as the input, while the low-level RNN is subsequently adopted to decode each topic vector into an output sentence. By modeling each sentence and coupling the sentences into one paragraph, these hierarchical architectures often outperform the flat models ((5)). To improve the performance and generate more diverse paragraphs, advanced methods, extending the HRNN based on generative adversarial network (GAN) ((7)) or variational auto-encoders (VAE) ((8)), are proposed by (9) and (10).
May-10-2021
- Country:
- Europe (1.00)
- North America > United States
- Maryland (0.28)
- Genre:
- Research Report (1.00)
- Industry:
- Leisure & Entertainment > Sports > Tennis (0.47)
- Technology: