AITopics | Dahua Lin

Contrastive Learning for Image Captioning

Bo Dai, Dahua Lin

Neural Information Processing SystemsMay-28-2025, 00:55:55 GMT

Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. However, the distinctiveness of natural descriptions is often overlooked in previous work. It is closely related to the quality of captions, as distinctive captions are more likely to describe images with their unique aspects. In this work, we propose a new learning method, Contrastive Learning (CL), for image captioning. Specifically, via two constraints formulated on top of a reference model, the proposed method can encourage distinctiveness, while maintaining the overall quality of the generated captions. We tested our method on two challenging datasets, where it improves the baseline model by significant margins. We also showed in our studies that the proposed method is generic and can be used for models with various structures.

artificial intelligence, caption, machine learning, (15 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Europe > Spain (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Leisure & Entertainment > Sports > Tennis (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

A Neural Compositional Paradigm for Image Captioning

Bo Dai, Sanja Fidler, Dahua Lin

Neural Information Processing SystemsMay-26-2025, 13:16:26 GMT

Mainstream captioning models often follow a sequential structure to generate captions, leading to issues such as introduction of irrelevant semantics, lack of diversity in the generated captions, and inadequate generalization performance. In this paper, we present an alternative paradigm for image captioning, which factorizes the captioning procedure into two stages: (1) extracting an explicit semantic representation from the given image; and (2) constructing the caption based on a recursive compositional procedure in a bottom-up manner. Compared to conventional ones, our paradigm better preserves the semantic content through an explicit factorization of semantics and syntax. By using the compositional generation procedure, caption construction follows a recursive structure, which naturally fits the properties of human language. Moreover, the proposed compositional procedure requires less data to train, generalizes better, and yields more diverse captions.

caption, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

Asia > China > Hong Kong (0.40)
North America > Canada > Ontario > Toronto (0.14)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

Trajectory Convolution for Action Recognition

Yue Zhao, Yuanjun Xiong, Dahua Lin

Neural Information Processing SystemsMay-26-2025, 07:58:41 GMT

How to leverage the temporal dimension is one major question in video analysis. Recent works [47, 36] suggest an efficient approach to video feature learning, i.e., factorizing 3D convolutions into separate components respectively for spatial and temporal convolutions. The temporal convolution, however, comes with an implicit assumption - the feature maps across time steps are well aligned so that the features at the same locations can be aggregated. This assumption can be overly strong in practical applications, especially in action recognition where the motion serves as a crucial cue. In this work, we propose a new CNN architecture TrajectoryNet, which incorporates trajectory convolution, a new operation for integrating features along the temporal dimension, to replace the existing temporal convolution. This operation explicitly takes into account the changes in contents caused by deformation or motion, allowing the visual features to be aggregated along the the motion paths, trajectories. On two large-scale action recognition datasets, Something-Something V1 and Kinetics, the proposed network architecture achieves notable improvement over strong baselines.

artificial intelligence, convolution, machine learning, (17 more...)

Neural Information Processing Systems

Country:

Asia (0.14)
North America > Canada (0.14)

Technology:

Information Technology > Artificial Intelligence > Vision > Image Understanding (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

A Neural Compositional Paradigm for Image Captioning

Bo Dai, Sanja Fidler, Dahua Lin

Neural Information Processing SystemsMar-26-2025, 12:49:15 GMT

Mainstream captioning models often follow a sequential structure to generate captions, leading to issues such as introduction of irrelevant semantics, lack of diversity in the generated captions, and inadequate generalization performance. In this paper, we present an alternative paradigm for image captioning, which factorizes the captioning procedure into two stages: (1) extracting an explicit semantic representation from the given image; and (2) constructing the caption based on a recursive compositional procedure in a bottom-up manner. Compared to conventional ones, our paradigm better preserves the semantic content through an explicit factorization of semantics and syntax. By using the compositional generation procedure, caption construction follows a recursive structure, which naturally fits the properties of human language. Moreover, the proposed compositional procedure requires less data to train, generalizes better, and yields more diverse captions.

caption, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > Canada (0.46)
Asia > China > Hong Kong (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

Trajectory Convolution for Action Recognition

Yue Zhao, Yuanjun Xiong, Dahua Lin

Neural Information Processing SystemsMar-26-2025, 11:31:30 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, convolution, machine learning, (19 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Policy Continuation with Hindsight Inverse Dynamics

Hao Sun, Zhizhong Li, Xiaotong Liu, Bolei Zhou, Dahua Lin

Neural Information Processing SystemsMar-23-2025, 03:59:37 GMT

Solving goal-oriented tasks is an important but challenging problem in reinforcement learning (RL). For such tasks, the rewards are often sparse, making it difficult to learn a policy effectively. To tackle this difficulty, we propose a new approach called Policy Continuation with Hindsight Inverse Dynamics (PCHID). This approach learns from Hindsight Inverse Dynamics based on Hindsight Experience Replay. Enabling the learning process in a self-imitated manner and thus can be trained with supervised learning. This work also extends it to multi-step settings with Policy Continuation. The proposed method is general, which can work in isolation or be combined with other on-policy and off-policy algorithms.

artificial intelligence, machine learning, reinforcement learning, (9 more...)

Neural Information Processing Systems

Country: Asia (0.28)

Industry: Leisure & Entertainment > Games > Computer Games (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Policy Continuation with Hindsight Inverse Dynamics

Hao Sun, Zhizhong Li, Xiaotong Liu, Bolei Zhou, Dahua Lin

Neural Information Processing SystemsJan-22-2025, 22:41:26 GMT

Solving goal-oriented tasks is an important but challenging problem in reinforcement learning (RL). For such tasks, the rewards are often sparse, making it difficult to learn a policy effectively. To tackle this difficulty, we propose a new approach called Policy Continuation with Hindsight Inverse Dynamics (PCHID). This approach learns from Hindsight Inverse Dynamics based on Hindsight Experience Replay. Enabling the learning process in a self-imitated manner and thus can be trained with supervised learning. This work also extends it to multi-step settings with Policy Continuation. The proposed method is general, which can work in isolation or be combined with other on-policy and off-policy algorithms.

artificial intelligence, machine learning, reinforcement learning, (11 more...)

Neural Information Processing Systems

Country: