Goto

Collaborating Authors

 Nanjing University of Science and Technology


Kill Two Birds With One Stone: Weakly-Supervised Neural Network for Image Annotation and Tag Refinement

AAAI Conferences

The number of social images has exploded by the wide adoption of social networks, and people like to share their comments about them. These comments can be a description of the image, or some objects, attributes, scenes in it, which are normally used as the user-provided tags. However, it is well-known that user-provided tags are incomplete and imprecise to some extent. Directly using them can damage the performance of related applications, such as the image annotation and retrieval. In this paper, we propose to learn an image annotation model and refine the user-provided tags simultaneously in a weakly-supervised manner. The deep neural network is utilized as the image feature learning and backbone annotation model, while visual consistency, semantic dependency, and user-error sparsity are introduced as the constraints at the batch level to alleviate the tag noise. Therefore, our model is highly flexible and stable to handle large-scale image sets. Experimental results on two benchmark datasets indicate that our proposed model achieves the best performance compared to the state-of-the-art methods.


Show, Reward and Tell: Automatic Generation of Narrative Paragraph From Photo Stream by Adversarial Training

AAAI Conferences

Impressive image captioning results (i.e., an objective description for an image) are achieved with plenty of training pairs. In this paper, we take one step further to investigate the creation of narrative paragraph for a photo stream. This task is even more challenging due to the difficulty in modeling an ordered photo sequence and in generating a relevant paragraph with expressive language style for storytelling. The difficulty can even be exacerbated by the limited training data, so that existing approaches almost focus on search-based solutions. To deal with these challenges, we propose a sequence-to-sequence modeling approach with reinforcement learning and adversarial training. First, to model the ordered photo stream, we propose a hierarchical recurrent neural network as story generator, which is optimized by reinforcement learning with rewards. Second, to generate relevant and story-style paragraphs, we design the rewards with two critic networks, including a multi-modal and a language-style discriminator. Third, we further consider the story generator and reward critics as adversaries. The generator aims to create indistinguishable paragraphs to human-level stories, whereas the critics aim at distinguishing them and further improving the generator by policy gradient. Experiments on three widely-used datasets show the effectiveness, against state-of-the-art methods with relative increase of 20.2% by METEOR. We also show the subjective preference for the proposed approach over the baselines through a user study with 30 human subjects.


Discovering and Distinguishing Multiple Visual Senses for Polysemous Words

AAAI Conferences

To reduce the dependence on labeled data, there have been increasing research efforts on learning visual classifiers by exploiting web images. One issue that limits their performance is the problem of polysemy. To solve this problem, in this work, we present a novel framework that solves the problem of polysemy by allowing sense-specific diversity in search results. Specifically, we first discover a list of possible semantic senses to retrieve sense-specific images. Then we merge visual similar semantic senses and prune noises by using the retrieved images. Finally, we train a visual classifier for each selected semantic sense and use the learned sense-specific classifiers to distinguish multiple visual senses. Extensive experiments on classifying images into sense-specific categories and re-ranking search results demonstrate the superiority of our proposed approach.


Label Distribution Learning by Exploiting Label Correlations

AAAI Conferences

Label distribution learning (LDL) is a newly arisen machine learning method that has been increasingly studied in recent years. In theory, LDL can be seen as a generalization of multi-label learning. Previous studies have shown that LDL is an effective approach to solve the label ambiguity problem. However, the dramatic increase in the number of possible label sets brings a challenge in performance to LDL. In this paper, we propose a novel label distribution learning algorithm to address the above issue. The key idea is to exploit correlations between different labels. We encode the label correlation into a distance to measure the similarity of any two labels. Moreover, we construct a distance-mapping function from the label set to the parameter matrix. Experimental results on eight real label distributed data sets demonstrate that the proposed algorithm performs remarkably better than both the state-of-the-art LDL methods and multi-label learning methods.


Spatio-Temporal Graph Convolution for Skeleton Based Action Recognition

AAAI Conferences

Variations of human body skeletons may be considered as dynamic graphs, which are generic data representation for numerous real-world applications. In this paper, we propose a spatio-temporal graph convolution (STGC) approach for assembling the successes of local convolutional filtering and sequence learning ability of autoregressive moving average. To encode dynamic graphs, the constructed multi-scale local graph convolution filters, consisting of matrices of local receptive fields and signal mappings, are recursively performed on structured graph data of temporal and spatial domain. The proposed model is generic and principled as it can be generalized into other dynamic models. We theoretically prove the stability of STGC and provide an upper-bound of the signal transformation to be learnt. Further, the proposed recursive model can be stacked into a multi-layer architecture. To evaluate our model, we conduct extensive experiments on four benchmark skeleton-based action datasets, including the large-scale challenging NTU RGB+D. The experimental results demonstrate the effectiveness of our proposed model and the improvement over the state-of-the-art.


Label Distribution Learning by Exploiting Sample Correlations Locally

AAAI Conferences

Label distribution learning (LDL) is a novel multi-label learning paradigm proposed in recent years for solving label ambiguity. Existing approaches typically exploit label correlations globally to improve the effectiveness of label distribution learning, by assuming that the label correlations are shared by all instances. However, different instances may share different label correlations, and few correlations are globally applicable in real-world applications. In this paper, we propose a new label distribution learning algorithm by exploiting sample correlations locally (LDL-SCL). To encode the influence of local samples, we design a local correlation vector for each instance based on the clustered local samples. Then we predict the label distribution for an unseen instance based on the original features and the local correlation vector simultaneously. Experimental results demonstrate that LDL-SCL can effectively deal with the label distribution problems and perform remarkably better than the state-of-the-art LDL methods.


Nonlinear Pairwise Layer and Its Training for Kernel Learning

AAAI Conferences

Kernel learning is a fundamental technique that has been intensively studied in the past decades. For the complicated practical tasks, the traditional "shallow" kernels (e.g., Gaussian kernel and sigmoid kernel) are not flexible enough to produce satisfactory performance. To address this shortcoming, this paper introduces a nonlinear layer in kernel learning to enhance the model flexibility. This layer is pairwise, which fully considers the coupling information among examples. So our model contains a fixed single mapping layer (i.e. a Gaussian kernel) as well as a nonlinear pairwise layer, thereby achieving better flexibility than the existing kernel structures. Moreover, the proposed structure can be seamlessly embedded to Support Vector Machines (SVM), of which the training process can be formulated as a joint optimization problem including nonlinear function learning and standard SVM optimization. We theoretically prove that the objective function is gradient-Lipschitz continuous, which further guides us how to accelerate the optimization process in a deep kernel architecture. Experimentally, we find that the proposed structure outperforms other state-ofthe-art kernel-based algorithms on various benchmark datasets, and thus the effectiveness of the incorporated pairwise layer with its training approach is demonstrated.


Compact Multi-Label Learning

AAAI Conferences

Embedding methods have shown promising performance in multi-label prediction, as they can discover the dependency of labels. Most embedding methods cannot well align the input and output, which leads to degradation in prediction performance. Besides, they suffer from expensive prediction computational costs when applied to large-scale datasets. To address the above issues, this paper proposes a Co-Hashing (CoH) method by formulating multi-label learning from the perspective of cross-view learning. CoH first regards the input and output as two views, and then aims to learn a common latent hamming space, where input and output pairs are compressed into compact binary embeddings. CoH enjoys two key benefits: 1) the input and output can be well aligned, and their correlations are explored; 2) the prediction is very efficient using fast cross-view kNN search in the hamming space. Moreover, we provide the generalization error bound for our method. Extensive experiments on eight real-world datasets demonstrate the superiority of the proposed CoH over the state-of-the-art methods in terms of both prediction accuracy and efficiency.


Exploring Commonality and Individuality for Multi-Modal Curriculum Learning

AAAI Conferences

Curriculum Learning (CL) mimics the cognitive process ofhumans and favors a learning algorithm to follow the logical learning sequence from simple examples to more difficult ones. Recent studies show that selecting the simplest curriculum examples from different modalities for graph-based label propagation can yield better performance than simply leveraging single modality. However, they forcibly requirethe curriculums generated by all modalities to be identical to a common curriculum, which discard the individuality ofevery modality and produce the inaccurate curriculum for the subsequent learning. Therefore, this paper proposes a novel multi-modal CL algorithm by comprehensively investigating both the individuality and commonality of different modalities. By considering the curriculums of multiple modalities altogether, their common preference on selecting the simplestexamples can be explored by a row-sparse matrix, and their distinct opinions are captured by a sparse noise matrix. As a consequence, a "soft" fusion of multiple curriculums from different modalities is achieved and the propagation quality can thus be improved. Comprehensive empirical studies reveal that our method can generate higher accuracy than the state-of-the-art multi-modal CL approach and label propagation algorithms on various image classification tasks.


Weakly-Supervised Deep Nonnegative Low-Rank Model for Social Image Tag Refinement and Assignment

AAAI Conferences

It has been well known that the user-provided tags of social images are imperfect, i.e., there exist noisy, irrelevant or incomplete tags. It heavily degrades the performance of many multimedia tasks. To alleviate this problem, we propose a Weakly-supervised Deep Nonnegative Low-rank model (WDNL) to improve the quality of tags by integrating the low-rank model with deep feature learning. A nonnegative low-rank model is introduced to uncover the intrinsic relationships between images and tags by simultaneously removing noisy or irrelevant tags and complementing missing tags. The deep architecture is leveraged to seamlessly connect the visual content and the semantic tag. That is, the proposed model can well handle the scalability by assigning tags to new images. Extensive experiments conducted on two real-world datasets demonstrate the effectiveness of the proposed method compared with some state-of-the-art methods.