Well File:

 Chuang Gan




Cross-channel Communication Networks

Neural Information Processing Systems

While a lot of progress has been made by making networks deeper, filters at each layer independently generate responses given the input and do not communicate with each other. In this paper, we introduce a novel network unit called Cross-channel Communication (C3) block, a simple yet effective module to encourage the communication across filters within the same layer. The C3 block enables filters to exchange information through a micro neural network, which consists of a feature encoder, a message passer, and a feature decoder, before sending the information to the next layer. With C3 block, each channel response is modulated by accounting for the responses at other channels. Extensive experiments on multiple vision tasks show that our proposed block brings improvements for different CNN architectures, and learns more diverse and complementary representations.


Visual Concept-Metaconcept Learning

Neural Information Processing Systems

Humans reason with concepts and metaconcepts: we recognize red and green from visual input; we also understand that they describe the same property of objects (i.e., the color). In this paper, we propose the visual concept-metaconcept learner (VCML) for joint learning of concepts and metaconcepts from images and associated question-answer pairs. The key is to exploit the bidirectional connection between visual concepts and metaconcepts. Visual representations provide grounding cues for predicting relations between unseen pairs of concepts. Knowing that red and green describe the same property of objects, we generalize to the fact that cube and sphere also describe the same property of objects, since they both categorize the shape of objects. Meanwhile, knowledge about metaconcepts empowers visual concept learning from limited, noisy, and even biased data. From just a few examples of purple cubes we can understand a new color purple, which resembles the hue of the cubes instead of the shape of them.


Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding

Neural Information Processing Systems

We marry two powerful ideas: deep representation learning for visual recognition and language understanding, and symbolic program execution for reasoning. Our neural-symbolic visual question answering (NS-VQA) system first recovers a structural scene representation from the image and a program trace from the question. It then executes the program on the scene representation to obtain an answer. Incorporating symbolic structure as prior knowledge offers three unique advantages. First, executing programs on a symbolic space is more robust to long program traces; our model can solve complex reasoning tasks better, achieving an accuracy of 99.8% on the CLEVR dataset. Second, the model is more data-and memory-efficient: it performs well after learning on a small number of training data; it can also encode an image into a compact representation, requiring less storage than existing methods for offline question answering. Third, symbolic program execution offers full transparency to the reasoning process; we are thus able to interpret and diagnose each execution step.


Weakly Supervised Dense Event Captioning in Videos

Neural Information Processing Systems

Dense event captioning aims to detect and describe all events of interest contained in a video. Despite the advanced development in this area, existing methods tackle this task by making use of dense temporal annotations, which is dramatically source-consuming. This paper formulates a new problem: weakly supervised dense event captioning, which does not require temporal segment annotations for model training. Our solution is based on the one-to-one correspondence assumption, each caption describes one temporal segment, and each temporal segment has one caption, which holds in current benchmark datasets and most real-world cases. We decompose the problem into a pair of dual problems: event captioning and sentence localization and present a cycle system to train our model. Extensive experimental results are provided to demonstrate the ability of our model on both dense event captioning and sentence localization in videos.


Cross-channel Communication Networks

Neural Information Processing Systems

While a lot of progress has been made by making networks deeper, filters at each layer independently generate responses given the input and do not communicate with each other. In this paper, we introduce a novel network unit called Cross-channel Communication (C3) block, a simple yet effective module to encourage the communication across filters within the same layer. The C3 block enables filters to exchange information through a micro neural network, which consists of a feature encoder, a message passer, and a feature decoder, before sending the information to the next layer. With C3 block, each channel response is modulated by accounting for the responses at other channels. Extensive experiments on multiple vision tasks show that our proposed block brings improvements for different CNN architectures, and learns more diverse and complementary representations.


Visual Concept-Metaconcept Learning

Neural Information Processing Systems

Humans reason with concepts and metaconcepts: we recognize red and green from visual input; we also understand that they describe the same property of objects (i.e., the color). In this paper, we propose the visual concept-metaconcept learner (VCML) for joint learning of concepts and metaconcepts from images and associated question-answer pairs. The key is to exploit the bidirectional connection between visual concepts and metaconcepts. Visual representations provide grounding cues for predicting relations between unseen pairs of concepts. Knowing that red and green describe the same property of objects, we generalize to the fact that cube and sphere also describe the same property of objects, since they both categorize the shape of objects. Meanwhile, knowledge about metaconcepts empowers visual concept learning from limited, noisy, and even biased data. From just a few examples of purple cubes we can understand a new color purple, which resembles the hue of the cubes instead of the shape of them.