"Computers have been getting better and better at seeing movement on video. How is it that they read lips, follow a dancing girl or copy an actor making faces?"
– from Andrew Blake. Introduction to Active Contours and Visual Dynamics. Visual Dynamics Group, Department of Engineering Science, University of Oxford
Union of Subspaces (UoS) model serves as an important model i n statistical machine learning. Briefly speaking, UoS models those high-dimensional da ta, encountered in many real-world problems, which lie close to low-dimensional subspaces corresponding to several classes to which the data belong, such as handwritten digits (Hasti e and Simard, 1998), face images (Basri and Jacobs, 2003), DNA microarray data (Parvare sh et al., 2008), and hyper-spectral images (Chen et al., 2011), to name just a few. A fund amental task in processing data points in UoS is to cluster these data points, which is kn own as Subspace Clustering (SC). Applications of SC has spanned all over science and eng ineering, including motion segmentation (Costeira and Kanade, 1998; Kanatani, 2001), face recognition (Wright et al., 2008), and classification of diseases (McWilliams and Monta na, 2014) and so on. We refer the reader to the tutorial paper (Vidal, 2011) for a review of the development of SC. The authors are with Department of Electronic Engineering, Tsinghua University, Beijing 100084, China. The corresponding author of this paper is Y. Gu (gyt@tsinghu a.edu.cn).
We introduce a simple framework for identifying biases of a smiling attribute classifier. Our method poses counterfactual questions of the form: how would the prediction change if this face characteristic had been different? We leverage recent advances in generative adversarial networks to build a realistic generative model of face images that affords controlled manipulation of specific image characteristics. We introduce a set of metrics that measure the effect of manipulating a specific property of an image on the output of a trained classifier. Empirically, we identify several different factors of variation that affect the predictions of a smiling classifier trained on CelebA.
Thanks to recent advances in Deep Neural Networks (DNNs), face recognition systems have achieved high accuracy in classification of a large number of face images. However, recent works demonstrate that DNNs could be vulnerable to adversarial examples and raise concerns about robustness of face recognition systems. In particular adversarial examples that are not restricted to small perturbations could be more serious risks since conventional certified defenses might be ineffective against them. To shed light on the vulnerability to this type of adversarial examples, we propose a flexible and efficient method to generate unrestricted adversarial examples using image translation techniques. Our method enables us to translate a source image into any desired facial appearance with large perturbations so that target face recognition systems could be deceived. Through our experiments, we demonstrate that our method achieves about 90% and 30% attack success rates under a white- and black-box setting, respectively. We also illustrate that our translated images are perceptually realistic and maintain personal identity while the perturbations are large enough to bypass certified defenses.
Facial expression recognition in videos is an active area of research in computer vision. However, fake facial expressions are difficult to be recognized even by humans. On the other hand, facial micro-expressions generally represent the actual emotion of a person, as it is a spontaneous reaction expressed through human face. Despite of a few attempts made for recognizing micro-expressions, still the problem is far from being a solved problem, which is depicted by the poor rate of accuracy shown by the state-of-the-art methods. A few CNN based approaches are found in the literature to recognize micro-facial expressions from still images. Whereas, a spontaneous micro-expression video contains multiple frames that have to be processed together to encode both spatial and temporal information. This paper proposes two 3D-CNN methods: MicroExpSTCNN and MicroExpFuseNet, for spontaneous facial micro-expression recognition by exploiting the spatiotemporal information in CNN framework. The MicroExpSTCNN considers the full spatial information, whereas the MicroExpFuseNet is based on the 3D-CNN feature fusion of the eyes and mouth regions. The experiments are performed over CAS(ME)^2 and SMIC micro-expression databases. The proposed MicroExpSTCNN model outperforms the state-of-the-art methods.
Automatic understanding of human affect using visual signals is a problem that has attracted significant interest over the past 20 years. However, human emotional states are quite complex. To appraise such states displayed in real-world settings, we need expressive emotional descriptors that are capable of capturing and describing this complexity. The circumplex model of affect, which is described in terms of valence (i.e., how positive or negative is an emotion) and arousal (i.e., power of the activation of the emotion), can be used for this purpose. Recent progress in the emotion recognition domain has been achieved through the development of deep neural architectures and the availability of very large training databases. To this end, Aff-Wild has been the first large-scale "in-the-wild" database, containing around 1,200,000 frames. In this paper, we build upon this database, extending it with 260 more subjects and 1,413,000 new video frames. We call the union of Aff-Wild with the additional data, Aff-Wild2. The videos are downloaded from Youtube and have large variations in pose, age, illumination conditions, ethnicity and profession. Both database-specific as well as cross-database experiments are performed in this paper, by utilizing the Aff-Wild2, along with the RECOLA database. The developed deep neural architectures are based on the joint training of state-of-the-art convolutional and recurrent neural networks with attention mechanism; thus exploiting both the invariant properties of convolutional features, while modeling temporal dynamics that arise in human behaviour via the recurrent layers. The obtained results show premise for utilization of the extended Aff-Wild, as well as of the developed deep neural architectures for visual analysis of human behaviour in terms of continuous emotion dimensions.
In this paper we investigate the feasibility of using synthetic data to augment face datasets. In particular, we propose a novel generative adversarial network (GAN) that can disentangle identity-related attributes from non-identity-related attributes. This is done by training an embedding network that maps discrete identity labels to an identity latent space that follows a simple prior distribution, and training a GAN conditioned on samples from that distribution. Our proposed GAN allows us to augment face datasets by generating both synthetic images of subjects in the training set and synthetic images of new subjects not in the training set. By using recent advances in GAN training, we show that the synthetic images generated by our model are photo-realistic, and that training with augmented datasets can indeed increase the accuracy of face recognition models as compared with models trained with real images alone.
US lawmakers have asked the Government Accountability Office to examine how face recognition technology is being used by companies and law enforcement agencies. The questioners: A group of Democrats from both the House of Representatives and the Senate sent a letter to the GAO asking to examine which agencies are using the technology, and what safeguards the industry has in place. Some form of government regulation could eventually be imposed. Eye spies: There is growing concern that unfettered use of facial recognition could enable greater government surveillance and automate discrimination. Some companies also appear concerned.
Whether you're interested in learning how to apply facial recognition to video streams, building a complete deep learning pipeline for image classification, or simply want to tinker with your Raspberry Pi and add image recognition to a hobby project, you'll need to learn OpenCV somewhere along the way. The truth is that learning OpenCV used to be quite challenging. The documentation was hard to navigate. The tutorials were hard to follow and incomplete. And even some of the books were a bit tedious to work through. The good news is learning OpenCV isn't as hard as it used to be. And in fact, I'll go as far as to say studying OpenCV has become significantly easier. And to prove it to you (and help you learn OpenCV), I've put together this complete guide to learning the fundamentals of the OpenCV library using the Python programming language. Let's go ahead and get started learning the basics of OpenCV and image processing. By the end of today's blog post, you'll understand the fundamentals of OpenCV.
Patient pain can be detected highly reliably from facial expressions using a set of facial muscle-based action units (AUs) defined by the Facial Action Coding System (FACS). A key characteristic of facial expression of pain is the simultaneous occurrence of pain-related AU combinations, whose automated detection would be highly beneficial for efficient and practical pain monitoring. Existing general Automated Facial Expression Recognition (AFER) systems prove inadequate when applied specifically for detecting pain as they either focus on detecting individual pain-related AUs but not on combinations or they seek to bypass AU detection by training a binary pain classifier directly on pain intensity data but are limited by lack of enough labeled data for satisfactory training. In this paper, we propose a new approach that mimics the strategy of human coders of decoupling pain detection into two consecutive tasks: one performed at the individual video-frame level and the other at video-sequence level. Using state-of-the-art AFER tools to detect single AUs at the frame level, we propose two novel data structures to encode AU combinations from single AU scores. Two weakly supervised learning frameworks namely multiple instance learning (MIL) and multiple clustered instance learning (MCIL) are employed corresponding to each data structure to learn pain from video sequences. Experimental results show an 87% pain recognition accuracy with 0.94 AUC (Area Under Curve) on the UNBC-McMaster Shoulder Pain Expression dataset. Tests on long videos in a lung cancer patient video dataset demonstrates the potential value of the proposed system for pain monitoring in clinical settings.
Constrained Local Models (CLMs) are a well-established family of methods for facial landmark detection. However, they have recently fallen out of favor to cascaded regression-based approaches. This is in part due to the inability of existing CLM local detectors to model the very complex individual landmark appearance that is affected by expression, illumination, facial hair, makeup, and accessories. In our work, we present a novel local detector -- Convolutional Experts Network (CEN) -- that brings together the advantages of neural architectures and mixtures of experts in an end-to-end framework. We further propose a Convolutional Experts Constrained Local Model (CE-CLM) algorithm that uses CEN as local detectors. We demonstrate that our proposed CE-CLM algorithm outperforms competitive state-of-the-art baselines for facial landmark detection by a large margin on four publicly-available datasets. Our approach is especially accurate and robust on challenging profile images.