Goto

Collaborating Authors

 sign recognition


VISAT: Benchmarking Adversarial and Distribution Shift Robustness in Traffic Sign Recognition with Visual Attributes

arXiv.org Artificial Intelligence

We present VISAT, a novel open dataset and benchmarking suite for evaluating model robustness in the task of traffic sign recognition with the presence of visual attributes. Built upon the Mapillary Traffic Sign Dataset (MTSD), our dataset introduces two benchmarks that respectively emphasize robustness against adversarial attacks and distribution shifts. For our adversarial attack benchmark, we employ the state-of-the-art Projected Gradient Descent (PGD) method to generate adversarial inputs and evaluate their impact on popular models. Additionally, we investigate the effect of adversarial attacks on attribute-specific multi-task learning (MTL) networks, revealing spurious correlations among MTL tasks. The MTL networks leverage visual attributes (color, shape, symbol, and text) that we have created for each traffic sign in our dataset. For our distribution shift benchmark, we utilize ImageNet-C's realistic data corruption and natural variation techniques to perform evaluations on the robustness of both base and MTL models. Moreover, we further explore spurious correlations among MTL tasks through synthetic alterations of traffic sign colors using color quantization techniques. Our experiments focus on two major backbones, ResNet-152 and ViT-B/32, and compare the performance between base and MTL models. The VISAT dataset and benchmarking framework contribute to the understanding of model robustness for traffic sign recognition, shedding light on the challenges posed by adversarial attacks and distribution shifts. We believe this work will facilitate advancements in developing more robust models for real-world applications in autonomous driving and cyber-physical systems.


Sign Language: Towards Sign Understanding for Robot Autonomy

arXiv.org Artificial Intelligence

Navigational signs are common aids for human wayfinding and scene understanding, but are underutilized by robots. We argue that they benefit robot navigation and scene understanding, by directly encoding privileged information on actions, spatial regions, and relations. Interpreting signs in open-world settings remains a challenge owing to the complexity of scenes and signs, but recent advances in vision-language models (VLMs) make this feasible. To advance progress in this area, we introduce the task of navigational sign understanding which parses locations and associated directions from signs. We offer a benchmark for this task, proposing appropriate evaluation metrics and curating a test set capturing signs with varying complexity and design across diverse public spaces, from hospitals to shopping malls to transport hubs. We also provide a baseline approach using VLMs, and demonstrate their promise on navigational sign understanding. Code and dataset are available on Github.


Road Traffic Sign Recognition method using Siamese network Combining Efficient-CNN based Encoder

arXiv.org Artificial Intelligence

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS 1 Road Traffic Sign Recognition Method Using Siamese Network Combining Efficient-CNN-Based Encoder Zhenghao Xi, Member, IEEE, Y uchao Shao, Y ang Zheng, Member, IEEE, Xiang Liu, Member, IEEE, Y aqi Liu, and Yitong Cai Abstract -- Traffic signs recognition (TSR) plays an essential role in assistant driving and intelligent transportation system. However, the noise of complex environment may lead to motion-blur or occlusion problems, which raise the tough challenge to real-time recognition with high accuracy and robust. In this article, we propose IECES-network which with improved encoders and Siamese net. The three-stage approach of our method includes Efficient-CNN based encoders, Siamese backbone and the fully-connected layers. We firstly use convolu-tional encoders to extract and encode the traffic sign features of augmented training samples and standard images. Then, we design the Siamese neural network with Efficient-CNN based encoder and contrastive loss function, which can be trained to improve the robustness of TSR problem when facing the samples of motion-blur and occlusion by computing the distance between inputs and templates. Additionally, the template branch of the proposed network can be stopped when executing the recognition tasks after training to raise the process speed of our real-time model, and alleviate the computational resource and parameter scale. Finally, we recombined the feature code and a fully-connected layer with SoftMax function to classify the codes of samples and recognize the category of traffic signs. The results of experiments on the Tsinghua-T encent 100K dataset and the German Traffic Sign Recognition Benchmark dataset demonstrate the performance of the proposed IECES-network. Compared with other state-of-the-art methods, in the case of motion-blur and occluded environment, the proposed method achieves competitive performance precision-recall and accuracy metric average is 88.1%, 86.43% and 86.1% with a 2.9M lightweight scale, respectively. Moreover, processing time of our model is 0.1s per frame, of which the speed is increased by 1.5 times compared with existing methods. Index T erms-- Traffic signs recognition, Siamese network, efficient-CNN based encoder . Received 11 September 2024; revised 25 November 2024; accepted 9 January 2025.


Mitigation of Camouflaged Adversarial Attacks in Autonomous Vehicles--A Case Study Using CARLA Simulator

arXiv.org Artificial Intelligence

Autonomous vehicles (AVs) rely heavily on cameras and artificial intelligence (AI) to make safe and accurate driving decisions. However, since AI is the core enabling technology, this raises serious cyber threats that hinder the large-scale adoption of AVs. Therefore, it becomes crucial to analyze the resilience of AV security systems against sophisticated attacks that manipulate camera inputs, deceiving AI models. In this paper, we develop camera-camouflaged adversarial attacks targeting traffic sign recognition (TSR) in AVs. Specifically, if the attack is initiated by modifying the texture of a stop sign to fool the AV's object detection system, thereby affecting the AV actuators. The attack's effectiveness is tested using the CARLA AV simulator and the results show that such an attack can delay the auto-braking response to the stop sign, resulting in potential safety issues. We conduct extensive experiments under various conditions, confirming that our new attack is effective and robust. Additionally, we address the attack by presenting mitigation strategies. The proposed attack and defense methods are applicable to other end-to-end trained autonomous cyber-physical systems.


Evaluating Adversarial Attacks on Traffic Sign Classifiers beyond Standard Baselines

arXiv.org Artificial Intelligence

Adversarial attacks on traffic sign classification models were among the first successfully tried in the real world. Since then, the research in this area has been mainly restricted to repeating baseline models, such as LISA-CNN or GTSRB-CNN, and similar experiment settings, including white and black patches on traffic signs. In this work, we decouple model architectures from the datasets and evaluate on further generic models to make a fair comparison. Furthermore, we compare two attack settings, inconspicuous and visible, which are usually regarded without direct comparison. Our results show that standard baselines like LISA-CNN or GTSRB-CNN are significantly more susceptible than the generic ones. We, therefore, suggest evaluating new attacks on a broader spectrum of baselines in the future. Our code is available at \url{https://github.com/KASTEL-MobilityLab/attacks-on-traffic-sign-recognition/}.


The American Sign Language Knowledge Graph: Infusing ASL Models with Linguistic Knowledge

arXiv.org Artificial Intelligence

Language models for American Sign Language (ASL) could make language technologies substantially more accessible to those who sign. To train models on tasks such as isolated sign recognition (ISR) and ASL-to-English translation, datasets provide annotated video examples of ASL signs. To facilitate the generalizability and explainability of these models, we introduce the American Sign Language Knowledge Graph (ASLKG), compiled from twelve sources of expert linguistic knowledge. We use the ASLKG to train neuro-symbolic models for 3 ASL understanding tasks, achieving accuracies of 91% on ISR, 14% for predicting the semantic features of unseen signs, and 36% for classifying the topic of Youtube-ASL videos.


Think Twice Before Recognizing: Large Multimodal Models for General Fine-grained Traffic Sign Recognition

arXiv.org Artificial Intelligence

We propose a new strategy called think twice before recognizing to improve fine-grained traffic sign recognition (TSR). Fine-grained TSR in the wild is difficult due to the complex road conditions, and existing approaches particularly struggle with cross-country TSR when data is lacking. Our strategy achieves effective fine-grained TSR by stimulating the multiple-thinking capability of large multimodal models (LMM). We introduce context, characteristic, and differential descriptions to design multiple thinking processes for the LMM. The context descriptions with center coordinate prompt optimization help the LMM to locate the target traffic sign in the original road images containing multiple traffic signs and filter irrelevant answers through the proposed prior traffic sign hypothesis. The characteristic description is based on few-shot in-context learning of template traffic signs, which decreases the cross-domain difference and enhances the fine-grained recognition capability of the LMM. The differential descriptions of similar traffic signs optimize the multimodal thinking capability of the LMM. The proposed method is independent of training data and requires only simple and uniform instructions. We conducted extensive experiments on three benchmark datasets and two real-world datasets from different countries, and the proposed method achieves state-of-the-art TSR results on all five datasets.


Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition

arXiv.org Artificial Intelligence

Recent multimodal large language models (MLLM) such as GPT-4o and GPT-4v have shown great potential in autonomous driving. In this paper, we propose a cross-domain few-shot in-context learning method based on the MLLM for enhancing traffic sign recognition (TSR). We first construct a traffic sign detection network based on Vision Transformer Adapter and an extraction module to extract traffic signs from the original road images. To reduce the dependence on training data and improve the performance stability of cross-country TSR, we introduce a cross-domain few-shot in-context learning method based on the MLLM. To enhance MLLM's fine-grained recognition ability of traffic signs, the proposed method generates corresponding description texts using template traffic signs. These description texts contain key information about the shape, color, and composition of traffic signs, which can stimulate the ability of MLLM to perceive fine-grained traffic sign categories. By using the description texts, our method reduces the cross-domain differences between template and real traffic signs. Our approach requires only simple and uniform textual indications, without the need for large-scale traffic sign images and labels. We perform comprehensive evaluations on the German traffic sign recognition benchmark dataset, the Belgium traffic sign dataset, and two real-world datasets taken from Japan. The experimental results show that our method significantly enhances the TSR performance.


Stream State-tying for Sign Language Recognition

arXiv.org Artificial Intelligence

It is a kind of visual language via hand and arm movements accompanying facial expression and lip motion. The facial expression and lip motion are less important than hand gestures in sign language, but they may help to understand some hand gestures. Digitized devices can be used to measure the temporal and spatial information of hand gestures, the typical devices are data gloves, position trackers. In this paper, we use two CyberGloves and a position tracker, i.e., Pohelmus 3SPACE with two receivers positioned on the wrist of each CyberGlove and one fixed at thorax as input devices to measure gestures. Chinese sign language is classified into two categories. One is hand gesture in which each gesture corresponds to a Chinese phrase. The other is fingerspelling in which each alphabet corresponds to a posture, and each Chinese sign corresponds to several postures performed continuously.


Adversarial Attacks on Traffic Sign Recognition: A Survey

arXiv.org Artificial Intelligence

Traffic sign recognition is an essential component of perception in autonomous vehicles, which is currently performed almost exclusively with deep neural networks (DNNs). However, DNNs are known to be vulnerable to adversarial attacks. Several previous works have demonstrated the feasibility of adversarial attacks on traffic sign recognition models. Traffic signs are particularly promising for adversarial attack research due to the ease of performing real-world attacks using printed signs or stickers. In this work, we survey existing works performing either digital or real-world attacks on traffic sign detection and classification models. We provide an overview of the latest advancements and highlight the existing research areas that require further investigation.