Plotting

 Tang, Sheng


Learning Monocular Depth from Events via Egomotion Compensation

arXiv.org Artificial Intelligence

Event cameras are neuromorphically inspired sensors that sparsely and asynchronously report brightness changes. Their unique characteristics of high temporal resolution, high dynamic range, and low power consumption make them well-suited for addressing challenges in monocular depth estimation (e.g., high-speed or low-lighting conditions). However, current existing methods primarily treat event streams as black-box learning systems without incorporating prior physical principles, thus becoming over-parameterized and failing to fully exploit the rich temporal information inherent in event camera data. To address this limitation, we incorporate physical motion principles to propose an interpretable monocular depth estimation framework, where the likelihood of various depth hypotheses is explicitly determined by the effect of motion compensation. To achieve this, we propose a Focus Cost Discrimination (FCD) module that measures the clarity of edges as an essential indicator of focus level and integrates spatial surroundings to facilitate cost estimation. Furthermore, we analyze the noise patterns within our framework and improve it with the newly introduced Inter-Hypotheses Cost Aggregation (IHCA) module, where the cost volume is refined through cost trend prediction and multi-scale cost consistency constraints. Extensive experiments on real-world and synthetic datasets demonstrate that our proposed framework outperforms cutting-edge methods by up to 10\% in terms of the absolute relative error metric, revealing superior performance in predicting accuracy.


Topology-Preserving Adversarial Training

arXiv.org Artificial Intelligence

Despite the effectiveness in improving the robustness of neural networks, adversarial training has suffered from the natural accuracy degradation problem, i.e., accuracy on natural samples has reduced significantly. In this study, we reveal that natural accuracy degradation is highly related to the disruption of the natural sample topology in the representation space by quantitative and qualitative experiments. Based on this observation, we propose Topology-pReserving Adversarial traINing (TRAIN) to alleviate the problem by preserving the topology structure of natural samples from a standard model trained only on natural samples during adversarial training. As an additional regularization, our method can easily be combined with various popular adversarial training algorithms in a plug-and-play manner, taking advantage of both sides. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet show that our proposed method achieves consistent and significant improvements over various strong baselines in most cases. Specifically, without additional data, our proposed method achieves up to 8.78% improvement in natural accuracy and 4.50% improvement in robust accuracy.


Overcoming Classifier Imbalance for Long-tail Object Detection with Balanced Group Softmax

arXiv.org Machine Learning

Solving long-tail large vocabulary object detection with deep learning based models is a challenging and demanding task, which is however under-explored.In this work, we provide the first systematic analysis on the underperformance of state-of-the-art models in front of long-tail distribution. We find existing detection methods are unable to model few-shot classes when the dataset is extremely skewed, which can result in classifier imbalance in terms of parameter magnitude. Directly adapting long-tail classification models to detection frameworks can not solve this problem due to the intrinsic difference between detection and classification.In this work, we propose a novel balanced group softmax (BAGS) module for balancing the classifiers within the detection frameworks through group-wise training. It implicitly modulates the training process for the head and tail classes and ensures they are both sufficiently trained, without requiring any extra sampling for the instances from the tail classes.Extensive experiments on the very recent long-tail large vocabulary object recognition benchmark LVIS show that our proposed BAGS significantly improves the performance of detectors with various backbones and frameworks on both object detection and instance segmentation. It beats all state-of-the-art methods transferred from long-tail image classification and establishes new state-of-the-art.Code is available at https://github.com/FishYuLi/BalancedGroupSoftmax.


Auto-Balanced Filter Pruning for Efficient Convolutional Neural Networks

AAAI Conferences

In recent years considerable research efforts have been devoted to compression techniques of convolutional neural networks (CNNs). Many works so far have focused on CNN connection pruning methods which produce sparse parameter tensors in convolutional or fully-connected layers. It has been demonstrated in several studies that even simple methods can effectively eliminate connections of a CNN. However, since these methods make parameter tensors just sparser but no smaller, the compression may not transfer directly to acceleration without support from specially designed hardware. In this paper, we propose an iterative approach named Auto-balanced Filter Pruning, where we pre-train the network in an innovative auto-balanced way to transfer the representational capacity of its convolutional layers to a fraction of the filters, prune the redundant ones, then re-train it to restore the accuracy. In this way, a smaller version of the original network is learned and the floating-point operations (FLOPs) are reduced. By applying this method on several common CNNs, we show that a large portion of the filters can be discarded without obvious accuracy drop, leading to significant reduction of computational burdens. Concretely, we reduce the inference cost of LeNet-5 on MNIST, VGG-16 and ResNet-56 on CIFAR-10 by 95.1%, 79.7% and 60.9%, respectively.


Zero-Shot Learning With Attribute Selection

AAAI Conferences

Zero-shot learning (ZSL) is regarded as an effective way to construct classification models for target classes which have no labeled samples available. The basic framework is to transfer knowledge from (different) auxiliary source classes having sufficient labeled samples with some attributes shared by target and source classes as bridge. Attributes play an important role in ZSL but they have not gained sufficient attention in recent years. Previous works mostly assume attributes are perfect and treat each attribute equally. However, as shown in this paper, different attributes have different properties, such as their class distribution, variance, and entropy, which may have considerable impact on ZSL accuracy if treated equally. Based on this observation, in this paper we propose to use a subset of attributes, instead of the whole set, for building ZSL models. The attribute selection is conducted by considering the information amount and predictability under a novel joint optimization framework. To our knowledge, this is the first work that notices the influence of attributes themselves and proposes to use a refined attribute set for ZSL. Since our approach focuses on selecting good attributes for ZSL, it can be combined to any attribute based ZSL approaches so as to augment their performance. Experiments on four ZSL benchmarks demonstrate that our approach can improve zero-shot classification accuracy and yield state-of-the-art results.


Image Caption with Global-Local Attention

AAAI Conferences

Image caption is becoming important in the field of artificial intelligence. Most existing methods based on CNN-RNN framework suffer from the problems of object missing and misprediction due to the mere use of global representation at image-level. To address these problems, in this paper, we propose a global-local attention (GLA) method by integrating local representation at object-level with global representation at image-level through attention mechanism. Thus, our proposed method can pay more attention to how to predict the salient objects more precisely with high recall while keeping context information at image-level cocurrently. Therefore, our proposed GLA method can generate more relevant sentences, and achieve the state-of-the-art performance on the well-known Microsoft COCO caption dataset with several popular metrics.