Goto

Collaborating Authors

 Zhu, Zhenyao


Multi-View Perceptron: a Deep Model for Learning Face Identity and View Representations

Neural Information Processing Systems

Various factors, such as identities, views (poses), and illuminations, are coupled in face images. Disentangling the identity and view representations is a major challenge in face recognition. Existing face recognition systems either use handcrafted features or learn features discriminatively to improve recognition accuracy. This is different from the behavior of human brain. Intriguingly, even without accessing 3D data, human not only can recognize face identity, but can also imagine face images of a person under different viewpoints given a single 2D image, making face perception in the brain robust to view changes.


Fully Supervised Speaker Diarization

arXiv.org Machine Learning

In this paper, we propose a fully supervised speaker diarization approach, named unbounded interleaved-state recurrent neural networks (UIS-RNN). Given extracted speaker-discriminative embeddings (a.k.a. d-vectors) from input utterances, each individual speaker is modeled by a parameter-sharing RNN, while the RNN states for different speakers interleave in the time domain. This RNN is naturally integrated with a distance-dependent Chinese restaurant process (ddCRP) to accommodate an unknown number of speakers. Our system is fully supervised and is able to learn from examples where time-stamped speaker labels are annotated. We achieved a 7.6% diarization error rate on NIST SRE 2000 CALLHOME, which is better than the state-of-the-art method using spectral clustering. Moreover, our method decodes in an online fashion while most state-of-the-art systems rely on offline clustering.


Principled Hybrids of Generative and Discriminative Domain Adaptation

arXiv.org Artificial Intelligence

We propose a probabilistic framework for domain adaptation that blends both generative and discriminative modeling in a principled way. Under this framework, generative and discriminative models correspond to specific choices of the prior over parameters. This provides us a very general way to interpolate between generative and discriminative extremes through different choices of priors. By maximizing both the marginal and the conditional log-likelihoods, models derived from this framework can use both labeled instances from the source domain as well as unlabeled instances from both source and target domains. Under this framework, we show that the popular reconstruction loss of autoencoder corresponds to an upper bound of the negative marginal log-likelihoods of unlabeled instances, where marginal distributions are given by proper kernel density estimations. This provides a way to interpret the empirical success of autoencoders in domain adaptation and semi-supervised learning. We instantiate our framework using neural networks, and build a concrete model, DAuto. Empirically, we demonstrate the effectiveness of DAuto on text, image and speech datasets, showing that it outperforms related competitors when domain adaptation is possible.


Face Model Compression by Distilling Knowledge from Neurons

AAAI Conferences

The recent advanced face recognition systems werebuilt on large Deep Neural Networks (DNNs) or theirensembles, which have millions of parameters. However, the expensive computation of DNNs make theirdeployment difficult on mobile and embedded devices. This work addresses model compression for face recognition,where the learned knowledge of a large teachernetwork or its ensemble is utilized as supervisionto train a compact student network. Unlike previousworks that represent the knowledge by the soften labelprobabilities, which are difficult to fit, we represent theknowledge by using the neurons at the higher hiddenlayer, which preserve as much information as the label probabilities, but are more compact. By leveragingthe essential characteristics (domain knowledge) of thelearned face representation, a neuron selection methodis proposed to choose neurons that are most relevant toface recognition. Using the selected neurons as supervisionto mimic the single networks of DeepID2+ andDeepID3, which are the state-of-the-art face recognition systems, a compact student with simple network structure achieves better verification accuracy on LFW than its teachers, respectively. When using an ensemble of DeepID2+ as teacher, a mimicked student is able to outperform it and achieves 51.6 times compression ratio and 90 times speed-up in inference, making this cumbersome model applicable on portable devices.


Multi-View Perceptron: a Deep Model for Learning Face Identity and View Representations

Neural Information Processing Systems

Various factors, such as identities, views (poses), and illuminations, are coupled in face images. Disentangling the identity and view representations is a major challenge in face recognition. Existing face recognition systems either use handcrafted features or learn features discriminatively to improve recognition accuracy. This is different from the behavior of human brain. Intriguingly, even without accessing 3D data, human not only can recognize face identity, but can also imagine face images of a person under different viewpoints given a single 2D image, making face perception in the brain robust to view changes. In this sense, human brain has learned and encoded 3D face models from 2D images. To take into account this instinct, this paper proposes a novel deep neural net, named multi-view perceptron (MVP), which can untangle the identity and view features, and infer a full spectrum of multi-view images in the meanwhile, given a single 2D face image. The identity features of MVP achieve superior performance on the MultiPIE dataset. MVP is also capable to interpolate and predict images under viewpoints that are unobserved in the training data.