Goto

Collaborating Authors

 Tang, Yuanyan


Group Multi-View Transformer for 3D Shape Analysis with Spatial Encoding

arXiv.org Artificial Intelligence

In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices due to their huge size of parameters. To address this problem, we introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible. Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT). In GMViT, the view-level ViT first establishes relationships between view-level features. Additionally, to capture deeper features, we employ the grouping module to enhance view-level features into group-level features. Finally, the group-level ViT aggregates group-level features into complete, well-formed 3D shape descriptors. Notably, in both ViTs, we introduce spatial encoding of camera coordinates as innovative position embeddings. Furthermore, we propose two compressed versions based on GMViT, namely GMViT-simple and GMViT-mini. To enhance the training effectiveness of the small models, we introduce a knowledge distillation method throughout the GMViT process, where the key outputs of each GMViT component serve as distillation targets. Extensive experiments demonstrate the efficacy of the proposed method. The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB. The smaller models, GMViT-simple and GMViT-mini, reduce the parameter size by 8 and 17.6 times, respectively, and improve shape recognition speed by 1.5 times on average, while preserving at least 90% of the classification and retrieval performance.


A Computational Model for Saliency Maps by Using Local Entropy

AAAI Conferences

This paper presents a computational framework for saliency maps. It employs the Earth Mover's Distance based on weighted-Histogram (EMD-wH) to measure the center-surround difference, instead of the Difference-of-Gaussian (DoG) filter used by traditional models. In addition, the model employs not only the traditional features such as colors, intensity and orientation but also the local entropy which expresses the local complexity. The major advantage of combining the local entropy map is that it can detect the salient regions which are not complex regions. Also, it uses a general framework to integrate the feature dimensions instead of summing the features directly. This model considers both local and global salient information, in contrast to the existing models that consider only one or the other. Furthermore, the "large scale bias" and "central bias" hypotheses are used in this model to select the fixation locations in the saliency map of different scales. The performance of this model is assessed by comparing their saliency maps and human fixation density. The results from this model are finally compared to those from other bottom-up models for reference.