Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models
Xiong, Songsong, Tziafas, Georgios, Kasaei, Hamidreza
–arXiv.org Artificial Intelligence
Robots operating in human-centered environments, such as retail stores, restaurants, and households, are often required to distinguish between similar objects in different contexts with a high degree of accuracy. However, fine-grained object recognition remains a challenge in robotics due to the high intra-category and low inter-category dissimilarities. In addition, the limited number of fine-grained 3D datasets poses a significant problem in addressing this issue effectively. In this paper, we propose a hybrid multi-modal Vision Transformer (ViT) and Convolutional Neural Networks (CNN) approach to improve the performance of fine-grained visual classification (FGVC). To address the shortage of FGVC 3D datasets, we generated two synthetic datasets. The first dataset consists of 20 categories related to restaurants with a total of 100 instances, while the second dataset contains 120 shoe instances. Our approach was evaluated on both datasets, and the results indicate that it outperforms both CNN-only and ViT-only baselines, achieving a recognition accuracy of 94.50 % and 93.51 % on the restaurant and shoe datasets, respectively. Additionally, we have made our FGVC RGB-D datasets available to the research community to enable further experimentation and advancement. Furthermore, we successfully integrated our proposed method with a robot framework and demonstrated its potential as a fine-grained perception tool in both simulated and real-world robotic scenarios.
arXiv.org Artificial Intelligence
Mar-6-2023
- Country:
- Europe
- Austria (0.04)
- France > Occitanie
- Haute-Garonne > Toulouse (0.04)
- Netherlands (0.04)
- North America > United States
- Arizona (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Europe
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Consumer Products & Services > Restaurants (0.34)
- Health & Medicine (0.68)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning
- Neural Networks > Deep Learning (1.00)
- Statistical Learning (1.00)
- Robots (1.00)
- Vision (1.00)
- Machine Learning
- Information Technology > Artificial Intelligence