Not enough data to create a plot.
Try a different view from the menu above.
Xue, Xiangyang
Nonlinear Monte Carlo Method for Imbalanced Data Learning
Shen, Xuli, Xu, Qing, Xue, Xiangyang
For basic machine learning problems, expected error is used to evaluate model performance. Since the distribution of data is usually unknown, we can make simple hypothesis that the data are sampled independently and identically distributed (i.i.d.) and the mean value of loss function is used as the empirical risk by Law of Large Numbers (LLN). This is known as the Monte Carlo method. However, when LLN is not applicable, such as imbalanced data problems, empirical risk will cause overfitting and might decrease robustness and generalization ability. Inspired by the framework of nonlinear expectation theory, we substitute the mean value of loss function with the maximum value of subgroup mean loss. We call it nonlinear Monte Carlo method. In order to use numerical method of optimization, we linearize and smooth the functional of maximum empirical risk and get the descent direction via quadratic programming. With the proposed method, we achieve better performance than SOTA backbone models with less training steps, and more robustness for basic regression and imbalanced classification tasks.
Sketch-BERT: Learning Sketch Bidirectional Encoder Representation from Transformers by Self-supervised Learning of Sketch Gestalt
Lin, Hangyu, Fu, Yanwei, Jiang, Yu-Gang, Xue, Xiangyang
Previous researches of sketches often considered sketches in pixel format and leveraged CNN based models in the sketch understanding. Fundamentally, a sketch is stored as a sequence of data points, a vector format representation, rather than the photo-realistic image of pixels. SketchRNN studied a generative neural representation for sketches of vector format by Long Short Term Memory networks (LSTM). Unfortunately, the representation learned by SketchRNN is primarily for the generation tasks, rather than the other tasks of recognition and retrieval of sketches. To this end and inspired by the recent BERT model, we present a model of learning Sketch Bidirectional Encoder Representation from Transformer (Sketch-BERT). We generalize BERT to sketch domain, with the novel proposed components and pre-training algorithms, including the newly designed sketch embedding networks, and the self-supervised learning of sketch gestalt. Particularly, towards the pre-training task, we present a novel Sketch Gestalt Model (SGM) to help train the Sketch-BERT. Experimentally, we show that the learned representation of Sketch-BERT can help and improve the performance of the downstream tasks of sketch recognition, sketch retrieval, and sketch gestalt.
Towards Instance-level Image-to-Image Translation
Shen, Zhiqiang, Huang, Mingyang, Shi, Jianping, Xue, Xiangyang, Huang, Thomas
Unpaired Image-to-image Translation is a new rising and challenging vision problem that aims to learn a mapping between unaligned image pairs in diverse domains. Recent advances in this field like MUNIT and DRIT mainly focus on disentangling content and style/attribute from a given image first, then directly adopting the global style to guide the model to synthesize new domain images. However, this kind of approaches severely incurs contradiction if the target domain images are content-rich with multiple discrepant objects. In this paper, we present a simple yet effective instance-aware image-to-image translation approach (INIT), which employs the fine-grained local (instance) and global styles to the target image spatially. The proposed INIT exhibits three import advantages: (1) the instance-level objective loss can help learn a more accurate reconstruction and incorporate diverse attributes of objects; (2) the styles used for target domain of local/global areas are from corresponding spatial regions in source domain, which intuitively is a more reasonable mapping; (3) the joint training process can benefit both fine and coarse granularity and incorporates instance information to improve the quality of global translation. We also collect a large-scale benchmark for the new instance-level translation task. We observe that our synthetic images can even benefit real-world vision tasks like generic object detection.
Spatial Mixture Models with Learnable Deep Priors for Perceptual Grouping
Yuan, Jinyang, Li, Bin, Xue, Xiangyang
Humans perceive the seemingly chaotic world in a structured and compositional way with the prerequisite of being able to segregate conceptual entities from the complex visual scenes. The mechanism of grouping basic visual elements of scenes into conceptual entities is termed as perceptual grouping. In this work, we propose a new type of spatial mixture models with learnable priors for perceptual grouping. Different from existing methods, the proposed method disentangles the representation of an object into `shape' and `appearance' which are modeled separately by the mixture weights and the conditional probability distributions. More specifically, each object in the visual scene is modeled by one mixture component, whose mixture weights and the parameter of the conditional probability distribution are generated by two neural networks, respectively. The mixture weights focus on modeling spatial dependencies (i.e., shape) and the conditional probability distributions deal with intra-object variations (i.e., appearance). In addition, the background is separately modeled as a special component complementary to the foreground objects. Our extensive empirical tests on two perceptual grouping datasets demonstrate that the proposed method outperforms the state-of-the-art methods under most experimental configurations. The learned conceptual entities are generalizable to novel visual scenes and insensitive to the diversity of objects.
A Multi-task Neural Approach for Emotion Attribution, Classification and Summarization
Tu, Guoyun, Fu, Yanwei, Li, Boyang, Gao, Jiarui, Jiang, Yu-Gang, Xue, Xiangyang
Emotional content is a crucial ingredient in user-generated videos. However, the sparsely expressed emotions in the user-generated video cause difficulties to emotions analysis in videos. In this paper, we propose a new neural approach---Bi-stream Emotion Attribution-Classification Network (BEAC-Net) to solve three related emotion analysis tasks: emotion recognition, emotion attribution and emotion-oriented summarization, in an integrated framework. BEAC-Net has two major constituents, an attribution network and a classification network. The attribution network extracts the main emotional segment that classification should focus on in order to mitigate the sparsity problem. The classification network utilizes both the extracted segment and the original video in a bi-stream architecture. We contribute a new dataset for the emotion attribution task with human-annotated ground-truth labels for emotion segments. Experiments on two video datasets demonstrate superior performance of the proposed framework and the complementary nature of the dual classification streams.
MEAL: Multi-Model Ensemble via Adversarial Learning
Shen, Zhiqiang, He, Zhankui, Xue, Xiangyang
Often the best performing deep neural models are ensembles of multiple base-level networks. Unfortunately, the space required to store these many networks, and the time required to execute them at test-time, prohibits their use in applications where test sets are large (e.g., ImageNet). In this paper, we present a method for compressing large, complex trained ensembles into a single network, where knowledge from a variety of trained deep neural networks (DNNs) is distilled and transferred to a single DNN. In order to distill diverse knowledge from different trained (teacher) models, we propose to use adversarial-based learning strategy where we define a block-wise training loss to guide and optimize the predefined student network to recover the knowledge in teacher models, and to promote the discriminator network to distinguish teacher vs. student features simultaneously. The proposed ensemble method (MEAL) of transferring distilled knowledge with adversarial learning exhibits three important advantages: (1) the student network that learns the distilled knowledge with discriminators is optimized better than the original model; (2) fast inference is realized by a single forward pass, while the performance is even better than traditional ensembles from multi-original models; (3) the student network can learn the distilled knowledge from a teacher model that has arbitrary structures. Extensive experiments on CIFAR-10/100, SVHN and ImageNet datasets demonstrate the effectiveness of our MEAL method. On ImageNet, our ResNet-50 based MEAL achieves top-1/5 21.79%/5.99% val error, which outperforms the original model by 2.06%/1.14%. Code and models are available at: https://github.com/AaronHeee/MEAL
Learning to Separate Domains in Generalized Zero-Shot and Open Set Learning: a probabilistic perspective
Dong, Hanze, Fu, Yanwei, Sigal, Leonid, Hwang, Sung Ju, Jiang, Yu-Gang, Xue, Xiangyang
This paper studies the problem of domain division problem which aims to segment instances drawn from different probabilistic distributions. Such a problem exists in many previous recognition tasks, such as Open Set Learning (OSL) and Generalized Zero-Shot Learning (G-ZSL), where the testing instances come from either seen or novel/unseen classes with different probabilistic distributions. Previous works only calibrate the confident prediction of classifiers of seen classes (W-SVM Scheirer et al. (2014)), or taking unseen classes as outliers Socher et al. (2013). In contrast, this paper proposes a probabilistic way of directly estimating and fine-tuning the decision boundary between seen and novel/unseen classes. In particular, we propose a domain division algorithm of learning to split the testing instances into known, unknown and uncertain domains, and then conduct recognition tasks in each domain. Two statistical tools, namely, bootstrapping and Kolmogorov-Smirnov (KS) Test, for the first time, are introduced to uncover and fine-tune the decision boundary of each domain. Critically, the uncertain domain is newly introduced in our framework to adopt those instances whose domain labels cannot be predicted confidently. Extensive experiments demonstrate that our approach achieved the state-of-the-art performance on OSL and G-ZSL benchmarks.
Dual Skipping Networks
Cheng, Changmao, Fu, Yanwei, Jiang, Yu-Gang, Liu, Wei, Lu, Wenlian, Feng, Jianfeng, Xue, Xiangyang
Inspired by the recent neuroscience studies on the left-right asymmetry of the human brain in processing low and high spatial frequency information, this paper introduces a dual skipping network which carries out coarse-to-fine object categorization. Such a network has two branches to simultaneously deal with both coarse and fine-grained classification tasks. Specifically, we propose a layer-skipping mechanism that learns a gating network to predict which layers to skip in the testing stage. This layer-skipping mechanism endows the network with good flexibility and capability in practice. Evaluations are conducted on several widely used coarse-to-fine object categorization benchmarks, and promising results are achieved by our proposed network model.
Pose-Normalized Image Generation for Person Re-identification
Qian, Xuelin, Fu, Yanwei, Wang, Wenxuan, Xiang, Tao, Wu, Yang, Jiang, Yu-Gang, Xue, Xiangyang
Person Re-identification (re-id) faces two major challenges: the lack of cross-view paired training data and learning discriminative identity-sensitive and view-invariant features in the presence of large pose variations. In this work, we address both problems by proposing a novel deep person image generation model for synthesizing realistic person images conditional on pose. The model is based on a generative adversarial network (GAN) and used specifically for pose normalization in re-id, thus termed pose-normalization GAN (PN-GAN). With the synthesized images, we can learn a new type of deep re-id feature free of the influence of pose variations. We show that this feature is strong on its own and highly complementary to features learned with the original images. Importantly, we now have a model that generalizes to any new re-id dataset without the need for collecting any training data for model fine-tuning, thus making a deep re-id model truly scalable. Extensive experiments on five benchmarks show that our model outperforms the state-of-the-art models, often significantly. In particular, the features learned on Market-1501 can achieve a Rank-1 accuracy of 68.67% on VIPeR without any model fine-tuning, beating almost all existing models fine-tuned on the dataset.
Vocabulary-informed Extreme Value Learning
Fu, Yanwei, Dong, HanZe, Ma, Yu-feng, Zhang, Zhengjun, Xue, Xiangyang
The novel unseen classes can be formulated as the extreme values of known classes. This inspired the recent works on open-set recognition \cite{Scheirer_2013_TPAMI,Scheirer_2014_TPAMIb,EVM}, which however can have no way of naming the novel unseen classes. To solve this problem, we propose the Extreme Value Learning (EVL) formulation to learn the mapping from visual feature to semantic space. To model the margin and coverage distributions of each class, the Vocabulary-informed Learning (ViL) is adopted by using vast open vocabulary in the semantic space. Essentially, by incorporating the EVL and ViL, we for the first time propose a novel semantic embedding paradigm -- Vocabulary-informed Extreme Value Learning (ViEVL), which embeds the visual features into semantic space in a probabilistic way. The learned embedding can be directly used to solve supervised learning, zero-shot and open set recognition simultaneously. Experiments on two benchmark datasets demonstrate the effectiveness of proposed frameworks.