Rui, Yong

A Distributed Approach towards Discriminative Distance Metric Learning Machine Learning

Distance metric learning is successful in discovering intrinsic relations in data. However, most algorithms are computationally demanding when the problem size becomes large. In this paper, we propose a discriminative metric learning algorithm, and develop a distributed scheme learning metrics on moderate-sized subsets of data, and aggregating the results into a global solution. The technique leverages the power of parallel computation. The algorithm of the aggregated distance metric learning (ADML) scales well with the data size and can be controlled by the partition. We theoretically analyse and provide bounds for the error induced by the distributed treatment. We have conducted experimental evaluation of ADML, both on specially designed tests and on practical image annotation tasks. Those tests have shown that ADML achieves the state-of-the-art performance at only a fraction of the cost incurred by most existing methods.

Sequence-to-Sequence Learning via Shared Latent Representation

AAAI Conferences

Sequence-to-sequence learning is a popular research area in deep learning, such as video captioning and speech recognition. Existing methods model this learning as a mapping process by first encoding the input sequence to a fixed-sized vector, followed by decoding the target sequence from the vector. Although simple and intuitive, such mapping model is task-specific, unable to be directly used for different tasks. In this paper, we propose a star-like framework for general and flexible sequence-to-sequence learning, where different types of media contents (the peripheral nodes) could be encoded to and decoded from a shared latent representation (SLR) (the central node). This is inspired by the fact that human brain could learn and express an abstract concept in different ways. The media-invariant property of SLR could be seen as a high-level regularization on the intermediate vector, enforcing it to not only capture the latent representation intra each individual media like the auto-encoders, but also their transitions like the mapping models. Moreover, the SLR model is content-specific, which means it only needs to be trained once for a dataset, while used for different tasks. We show how to train a SLR model via dropout and use it for different sequence-to-sequence tasks. Our SLR model is validated on the Youtube2Text and MSR-VTT datasets, achieving superior performance on video-to-sentence task, and the first sentence-to-video results.

Offline Sketch Parsing via Shapeness Estimation

AAAI Conferences

In this work, we target at the problem of offline sketch parsing, in which the temporal orders of strokes are unavailable. It is more challenging than most of existing work, which usually leverages the temporal information to reduce the search space. Different from traditional approaches in which thousands of candidate groups are selected for recognition, we propose the idea of shapeness estimation to greatly reduce this number in a very fast way. Based on the observation that most of hand-drawn shapes with well-defined closed boundaries can be clearly differentiated from non-shapes if normalized into a very small size, we propose an efficient shapeness estimation method. A compact feature representation as well as its efficient extraction method is also proposed to speed up this process. Based on the proposed shapeness estimation, we present a three-stage cascade framework for offline sketch parsing. The shapeness estimation technique in this framework greatly reduces the number of false positives, resulting in a 96.2% detection rate with only 32 candidate group proposals, which is two orders of magnitude less than existing methods. Extensive experiments show the superiority of the proposed framework over state-of-the-art works on sketch parsing in both effectiveness and efficiency, even though they leveraged the temporal information of strokes.

Learning Word Representation Considering Proximity and Ambiguity

AAAI Conferences

Distributed representations of words (aka word embedding) have proven helpful in solving natural language processing (NLP) tasks. Training distributed representations of words with neural networks has lately been a major focus of researchers in the field. Recent work on word embedding, the Continuous Bag-of-Words (CBOW) model and the Continuous Skip-gram (Skip-gram) model, have produced particularly impressive results, significantly speeding up the training process to enable word representation learning from large-scale data. However, both CBOW and Skip-gram do not pay enough attention to word proximity in terms of model or word ambiguity in terms of linguistics. In this paper, we propose Proximity-Ambiguity Sensitive (PAS) models (i.e. PAS CBOW and PAS Skip-gram) to produce high quality distributed representations of words considering both word proximity and ambiguity. From the model perspective, we introduce proximity weights as parameters to be learned in PAS CBOW and used in PAS Skip-gram. By better modeling word proximity, we reveal the strength of pooling-structured neural networks in word representation learning. The proximity-sensitive pooling layer can also be applied to other neural network applications that employ pooling layers. From the linguistics perspective, we train multiple representation vectors per word. Each representation vector corresponds to a particular group of POS tags of the word. By using PAS models, we achieved a 16.9% increase in accuracy over state-of-the-art models.

Sketch Recognition with Natural Correction and Editing

AAAI Conferences

In this paper, we target at the problem of sketch recognition. We systematically study how to incorporate users' correction and editing into isolated and full sketch recognition. This is a natural and necessary interaction in real systems such as Visio where very similar shapes exist. First, a novel algorithm is proposed to mine the prior shape knowledge for three editing modes. Second, to differentiate visually similar shapes, a novel symbol recognition algorithm is introduced by leveraging the learnt shape knowledge. Then, a novel editing detection algorithm is proposed to facilitate symbol recognition. Furthermore, both of the symbol recognizer and the editing detector are systematically incorporated into the full sketch recognition. Finally, based on the proposed algorithms, a real-time sketch recognition system is built to recognize hand-drawn flowcharts and diagrams with flexible interactions. Extensive experiments show the effectiveness of the proposed algorithms.

Sparse Transfer Learning for Interactive Video Search Reranking Machine Learning

Visual reranking is effective to improve the performance of the text-based video search. However, existing reranking algorithms can only achieve limited improvement because of the well-known semantic gap between low level visual features and high level semantic concepts. In this paper, we adopt interactive video search reranking to bridge the semantic gap by introducing user's labeling effort. We propose a novel dimension reduction tool, termed sparse transfer learning (STL), to effectively and efficiently encode user's labeling information. STL is particularly designed for interactive video search reranking. Technically, it a) considers the pair-wise discriminative information to maximally separate labeled query relevant samples from labeled query irrelevant ones, b) achieves a sparse representation for the subspace to encodes user's intention by applying the elastic net penalty, and c) propagates user's labeling information from labeled samples to unlabeled samples by using the data distribution knowledge. We conducted extensive experiments on the TRECVID 2005, 2006 and 2007 benchmark datasets and compared STL with popular dimension reduction algorithms. We report superior performance by using the proposed STL based interactive video search reranking.