Facebook machine learning aims to modify faces, hands and… outfits – TechCrunch


The latest research out of Facebook sets machine learning models to tasks that, to us, seem rather ordinary -- but for a computer are still monstrously difficult. These projects aim to anonymize faces, improvise hand movements and -- perhaps hardest of all -- give credible fashion advice. The research here was presented recently at the International Conference on Computer Vision, among a few dozen other papers from the company, which has invested heavily in AI research, computer vision in particular. Modifying faces in motion is something we've all come to associate with "deepfakes" and other nefarious applications. But the Facebook team felt there was actually a potentially humanitarian application of the technology.

Using deep neural networks for accurate hand-tracking on Oculus Quest


Researchers and engineers from Facebook Reality Labs and Oculus have developed what is, as of today, the only fully articulated hand-tracking system for VR that relies entirely on monochrome cameras. The system does not use active depth-sensing technology or any additional equipment (such as instrumented gloves). We will deploy this technology as a software update for Oculus Quest, the cable-free, stand-alone VR headset that is now available to consumers. By using Quest's four cameras in conjunction with new techniques in deep learning and model-based tracking, we achieve a larger interaction volume for hand-tracking than depth-based solutions do, and we do it at a fraction of the size, weight, power, and cost. Processing is done entirely on-device, and the system is optimized to support gestures for interaction, such as pointing and pinch to select.

Learning Multiparametric Biomarkers for Assessing MR-Guided Focused Ultrasound Treatments Using Volume-Conserving Registration Machine Learning

Noninvasive MR-guided focused ultrasound (MRgFUS) treatments are promising alternatives to the surgical removal of malignant tumors. A significant challenge is assessing the treated tissue immediately after MRgFUS procedures. Although current clinical assessment uses the immediate nonperfused volume (NPV) biomarker derived from contrast enhanced imaging, the use of contrast agent prevents continuing MRgFUS treatment if margins are not adequate. In addition, the NPV has been shown to provide variable accuracy for the true treatment outcome as evaluated by follow-up biomarkers. This work presents a novel, noncontrast, learned multiparametric MR biomarker that is conducive for intratreatment assessment. MRgFUS ablations were performed in a rabbit VX2 tumor model. Multiparametric MRI was obtained both during and immediately after the MRgFUS ablation, as well as during follow-up imaging. Segmentation of the NPV obtained during follow-up imaging was used to train a neural network on noncontrast multiparametric MR images. The NPV follow-up segmentation was registered to treatment-day images using a novel volume-conserving registration algorithm, allowing a voxel-wise correlation between imaging sessions. Contrasted with state-of-the-art registration algorithms that change the average volume by 16.8%, the presented volume-conserving registration algorithm changes the average volume by only 0.28%. After registration, the learned multiparametric MR biomarker predicted the follow-up NPV with an average DICE coefficient of 0.71, outperforming the DICE coefficient of 0.53 from the current standard of NPV obtained immediately after the ablation treatment. Noncontrast multiparametric MR imaging can provide a more accurate prediction of treated tissue immediately after treatment. Noncontrast assessment of MRgFUS procedures will potentially lead to more efficacious MRgFUS ablation treatments.

A New Framework for Multi-Agent Reinforcement Learning -- Centralized Training and Exploration with Decentralized Execution via Policy Distillation Machine Learning

Deep reinforcement learning (DRL) is a booming area of artificial intelligence. Many practical applications of DRL naturally involve more than one collaborative learners, making it important to study DRL in a multi-agent context. Previous research showed that effective learning in complex multi-agent systems demands for highly coordinated environment exploration among all the participating agents. Many researchers attempted to cope with this challenge through learning centralized value functions. However, the common strategy for every agent to learn their local policies directly often fail to nurture strong inter-agent collaboration and can be sample inefficient whenever agents alter their communication channels. To address these issues, we propose a new framework known as centralized training and exploration with decentralized execution via policy distillation. Guided by this framework and the maximum-entropy learning technique, we will first train agents' policies with shared global component to foster coordinated and effective learning. Locally executable policies will be derived subsequently from the trained global policies via policy distillation. Experiments show that our new framework and algorithm can achieve significantly better performance and higher sample efficiency than a cutting-edge baseline on several multi-agent DRL benchmarks.

Unsupervised learning of landmarks by Descriptor Vector Exchange


Equivariance to random image transformations is an effective method to learn landmarks of object categories, such as the eyes and the nose in faces, without manual supervision. However, this method does not explicitly guarantee that the learned landmarks are consistent with changes between different instances of the same object, such as different facial identities. In this paper, we develop a new perspective on the equivariance approach by noting that dense landmark detectors can be interpreted as local image descriptors equipped with invariance to intra-category variations. We then propose a direct method to enforce such an invariance in the standard equivariant loss. We do so by exchanging descriptor vectors between images of different object instances prior to matching them geometrically.

Samsung AI Makes the Mona Lisa 'Speak'


Imagine the lips forming the Mona Lisa's famous smile were to part, and she began "speaking" to you. This is not some sci-fi fantasy or a 3D face animation, it's an effect achieved by researchers from Samsung AI lab and Skolkovo Institute of Science and Technology, who used adversarial learning to generate a photorealistic talking head model. AI techniques have already been used to generate realistic video of people like former US President Barack Obama and movie star Scarlett Johansson, enabled in large part by the abundance of available visual data on these individuals. The new research however shows it is also possible to generate realistic content when source images are rare. Researchers leveraged their Few-Shot Adversarial Learning technique on one of the most widely recognized humans in history known through a single image: Lisa Gherardini, the subject of Leonardo da Vinci's classic 16th century portrait.

Stochastic Triangular Mesh Mapping Machine Learning

For mobile robots to operate autonomously in general environments, perception is required in the form of a dense metric map. For this purpose, we present the stochastic triangular mesh (STM) mapping technique: a 2.5-D representation of the surface of the environment using a continuous mesh of triangular surface elements, where each surface element models the mean plane and roughness of the underlying surface. In contrast to existing mapping techniques, a STM map models the structure of the environment by ensuring a continuous model, while also being able to be incrementally updated with linear computational cost in the number of measurements. We reduce the effect of uncertainty in the robot pose (position and orientation) by using landmark-relative submaps. The uncertainty in the measurements and robot pose are accounted for by the use of Bayesian inference techniques during the map update. We demonstrate that a STM map can be used with sensors that generate point measurements, such as light detection and ranging (LiDAR) sensors and stereo cameras. We show that a STM map is a more accurate model than the only comparable online surface mapping technique$\unicode{x2014}$a standard elevation map$\unicode{x2014}$and we also provide qualitative results on practical datasets.

Exploiting multi-CNN features in CNN-RNN based Dimensional Emotion Recognition on the OMG in-the-wild Dataset Machine Learning

This paper presents a novel CNN-RNN based approach, which exploits multiple CNN features for dimensional emotion recognition in-the-wild, utilizing the One-Minute Gradual-Emotion (OMG-Emotion) dataset. Our approach includes first pre-training with the relevant and large in size, Aff-Wild and Aff-Wild2 emotion databases. Low-, mid- and high-level features are extracted from the trained CNN component and are exploited by RNN subnets in a multi-task framework. Their outputs constitute an intermediate level prediction; final estimates are obtained as the mean or median values of these predictions. Fusion of the networks is also examined for boosting the obtained performance, at Decision-, or at Model-level; in the latter case a RNN was used for the fusion. Our approach, although using only the visual modality, outperformed state-of-the-art methods that utilized audio and visual modalities. Some of our developments have been submitted to the OMG-Emotion Challenge, ranking second among the technologies which used only visual information for valence estimation; ranking third overall. Through extensive experimentation, we further show that arousal estimation is greatly improved when low-level features are combined with high-level ones.



This paper proposes a method for head pose estimation from a single image. Previous methods often predicts head poses through landmark or depth estimation and would require more computation than necessary. Our method is based on regression and feature aggregation. For having a compact model, we employ the soft stagewise regression scheme. Existing feature aggregation methods treat inputs as a bag of features and thus ignore their spatial relationship in a feature map.

Video Interpolation and Prediction with Unsupervised Landmarks Machine Learning

Prediction and interpolation for long-range video data involves the complex task of modeling motion trajectories for each visible object, occlusions and dis-occlusions, as well as appearance changes due to viewpoint and lighting. Optical flow based techniques generalize but are suitable only for short temporal ranges. Many methods opt to project the video frames to a low dimensional latent space, achieving long-range predictions. However, these latent representations are often non-interpretable, and therefore difficult to manipulate. This work poses video prediction and interpolation as unsupervised latent structure inference followed by a temporal prediction in this latent space. The latent representations capture foreground semantics without explicit supervision such as keypoints or poses. Further, as each landmark can be mapped to a coordinate indicating where a semantic part is positioned, we can reliably interpolate within the coordinate domain to achieve predictable motion interpolation. Given an image decoder capable of mapping these landmarks back to the image domain, we are able to achieve high-quality long-range video interpolation and extrapolation by operating on the landmark representation space.