Yang, Ming-Hsuan
Learning Spatial and Spatio-Temporal Pixel Aggregations for Image and Video Denoising
Xu, Xiangyu, Li, Muchen, Sun, Wenxiu, Yang, Ming-Hsuan
Existing denoising methods typically restore clear results by aggregating pixels from the noisy input. Instead of relying on hand-crafted aggregation schemes, we propose to explicitly learn this process with deep neural networks. We present a spatial pixel aggregation network and learn the pixel sampling and averaging strategies for image denoising. The proposed model naturally adapts to image structures and can effectively improve the denoised results. Furthermore, we develop a spatio-temporal pixel aggregation network for video denoising to efficiently sample pixels across the spatio-temporal space. Our method is able to solve the misalignment issues caused by large motion in dynamic scenes. In addition, we introduce a new regularization term for effectively training the proposed video denoising model. We present extensive analysis of the proposed method and demonstrate that our model performs favorably against the state-of-the-art image and video denoising approaches on both synthetic and real-world data.
Multi-path Neural Networks for On-device Multi-domain Visual Classification
Wang, Qifei, Ke, Junjie, Greaves, Joshua, Chu, Grace, Bender, Gabriel, Sbaiz, Luciano, Go, Alec, Howard, Andrew, Yang, Feng, Yang, Ming-Hsuan, Gilbert, Jeff, Milanfar, Peyman
Learning multiple domains/tasks with a single model is important for improving data efficiency and lowering inference cost for numerous vision tasks, especially on resource-constrained mobile devices. However, hand-crafting a multi-domain/task model can be both tedious and challenging. This paper proposes a novel approach to automatically learn a multi-path network for multi-domain visual classification on mobile devices. The proposed multi-path network is learned from neural architecture search by applying one reinforcement learning controller for each domain to select the best path in the super-network created from a MobileNetV3-like search space. An adaptive balanced domain prioritization algorithm is proposed to balance optimizing the joint model on multiple domains simultaneously. The determined multi-path model selectively shares parameters across domains in shared nodes while keeping domain-specific parameters within non-shared nodes in individual domain paths. This approach effectively reduces the total number of parameters and FLOPS, encouraging positive knowledge transfer while mitigating negative interference across domains. Extensive evaluations on the Visual Decathlon dataset demonstrate that the proposed multi-path model achieves state-of-the-art performance in terms of accuracy, model size, and FLOPS against other approaches using MobileNetV3-like architectures. Furthermore, the proposed method improves average accuracy over learning single-domain models individually, and reduces the total number of parameters and FLOPS by 78% and 32% respectively, compared to the approach that simply bundles single-domain models for multi-domain learning.
Semi-Supervised Learning with Meta-Gradient
Zhang, Xin-Yu, Jia, Hao-Lin, Xiao, Taihong, Cheng, Ming-Ming, Yang, Ming-Hsuan
In this work, we propose a simple yet effective meta-learning algorithm in the semi-supervised settings. We notice that existing consistency-based approaches mostly do not consider the essential role of the label information for consistency regularization. To alleviate this issue, we bridge the relationship between the consistency loss and label information by unfolding and differentiating through one optimization step. Specifically, we exploit the pseudo labels of the unlabeled examples which are guided by the meta-gradients of the labeled data loss so that the model can generalize well on the labeled examples. In addition, we introduce a simple first-order approximation to avoid computing higher-order derivatives and guarantee scalability. Extensive evaluations on the SVHN, CIFAR, and ImageNet datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
Joint-task Self-supervised Learning for Temporal Correspondence
Li, Xueting, Liu, Sifei, Mello, Shalini De, Wang, Xiaolong, Kautz, Jan, Yang, Ming-Hsuan
This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions and establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking.
Self-supervised Audio Spatialization with Correspondence Classifier
Lu, Yu-Ding, Lee, Hsin-Ying, Tseng, Hung-Yu, Yang, Ming-Hsuan
Spatial audio is an essential medium to audiences for 3D visual and auditory experience. However, the recording devices and techniques are expensive or inaccessible to the general public. In this work, we propose a self-supervised audio spatialization network that can generate spatial audio given the corresponding video and monaural audio. To enhance spatialization performance, we use an auxiliary classifier to classify ground-truth videos and those with audio where the left and right channels are swapped. We collect a large-scale video dataset with spatial audio to validate the proposed method. Experimental results demonstrate the effectiveness of the proposed model on the audio spatialization task.
Deep Non-Blind Deconvolution via Generalized Low-Rank Approximation
Ren, Wenqi, Zhang, Jiawei, Ma, Lin, Pan, Jinshan, Cao, Xiaochun, Zuo, Wangmeng, Liu, Wei, Yang, Ming-Hsuan
In this paper, we present a deep convolutional neural network to capture the inherent properties of image degradation, which can handle different kernels and saturated pixels in a unified framework. The proposed neural network is motivated by the low-rank property of pseudo-inverse kernels. We first compute a generalized low-rank approximation for a large number of blur kernels, and then use separable filters to initialize the convolutional parameters in the network. Our analysis shows that the estimated decomposed matrices contain the most essential information of the input kernel, which ensures the proposed network to handle various blurs in a unified framework and generate high-quality deblurring results. Experimental results on benchmark datasets with noise and saturated pixels demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.
Context-aware Synthesis and Placement of Object Instances
Lee, Donghoon, Liu, Sifei, Gu, Jinwei, Liu, Ming-Yu, Yang, Ming-Hsuan, Kautz, Jan
Learning to insert an object instance into an image in a semantically coherent manner is a challenging and interesting problem. Solving it requires (a) determining a location to place an object in the scene and (b) determining its appearance at the location. Such an object insertion model can potentially facilitate numerous image editing and scene parsing applications. In this paper, we propose an end-to-end trainable neural network for the task of inserting an object instance mask of a specified class into the semantic label map of an image. Our network consists of two generative modules where one determines where the inserted object mask should be (i.e., location and scale) and the other determines what the object mask shape (and pose) should look like. The two modules are connected together via a spatial transformation network and jointly trained. We devise a learning procedure that leverage both supervised and unsupervised data and show our model can insert an object at diverse locations with various appearances. We conduct extensive experimental validations with comparisons to strong baselines to verify the effectiveness of the proposed network. Code is available at https: //github.com/NVlabs/Instance_Insertion.
Deep Attentive Tracking via Reciprocative Learning
Pu, Shi, Song, Yibing, Ma, Chao, Zhang, Honggang, Yang, Ming-Hsuan
Visual attention, derived from cognitive neuroscience, facilitates human perception on the most pertinent subset of the sensory data. Recently, significant efforts have been made to exploit attention schemes to advance computer vision systems. For visual tracking, it is often challenging to track target objects undergoing large appearance changes. Attention maps facilitate visual tracking by selectively paying attention to temporal robust features. Existing tracking-by-detection approaches mainly use additional attention modules to generate feature weights as the classifiers are not equipped with such mechanisms. In this paper, we propose a reciprocative learning algorithm to exploit visual attention for training deep classifiers. The proposed algorithm consists of feed-forward and backward operations to generate attention maps, which serve as regularization terms coupled with the original classification loss function for training. The deep classifier learns to attend to the regions of target objects robust to appearance changes. Extensive experiments on large-scale benchmark datasets show that the proposed attentive tracking method performs favorably against the state-of-the-art approaches.
Deep Attentive Tracking via Reciprocative Learning
Pu, Shi, Song, Yibing, Ma, Chao, Zhang, Honggang, Yang, Ming-Hsuan
Visual attention, derived from cognitive neuroscience, facilitates human perception on the most pertinent subset of the sensory data. Recently, significant efforts have been made to exploit attention schemes to advance computer vision systems. For visual tracking, it is often challenging to track target objects undergoing large appearance changes. Attention maps facilitate visual tracking by selectively paying attention to temporal robust features. Existing tracking-by-detection approaches mainly use additional attention modules to generate feature weights as the classifiers are not equipped with such mechanisms. In this paper, we propose a reciprocative learning algorithm to exploit visual attention for training deep classifiers. The proposed algorithm consists of feed-forward and backward operations to generate attention maps, which serve as regularization terms coupled with the original classification loss function for training. The deep classifier learns to attend to the regions of target objects robust to appearance changes. Extensive experiments on large-scale benchmark datasets show that the proposed attentive tracking method performs favorably against the state-of-the-art approaches.
Deep Non-Blind Deconvolution via Generalized Low-Rank Approximation
Ren, Wenqi, Zhang, Jiawei, Ma, Lin, Pan, Jinshan, Cao, Xiaochun, Zuo, Wangmeng, Liu, Wei, Yang, Ming-Hsuan
In this paper, we present a deep convolutional neural network to capture the inherent properties of image degradation, which can handle different kernels and saturated pixels in a unified framework. The proposed neural network is motivated by the low-rank property of pseudo-inverse kernels. Specifically, we first compute a generalized low-rank approximation to a large number of blur kernels, and then use separable filters to initialize the convolutional parameters in the network. Our analysis shows that the estimated decomposed matrices contain the most essential information of an input kernel, which ensures the proposed network to handle various blurs in a unified framework and generate high-quality deblurring results. Experimental results on benchmark datasets with noisy and saturated pixels demonstrate that the proposed deconvolution approach relying on generalized low-rank approximation performs favorably against the state-of-the-arts.