Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering
Shen, Zhixuan, Luo, Haonan, Li, Sijia, Li, Tianrui
–arXiv.org Artificial Intelligence
These methods typically involve largescale Scene-Text Visual Question Answering (ST-VQA) aims to pretraining followed by fine-tuning to adapt the model for understand scene text in images and answer questions related question-answering tasks in text-rich scene images, often ignoring to the text content. Most existing methods heavily rely on the the inevitable OCR text recognition challenges. In practice, accuracy of Optical Character Recognition (OCR) systems, scene images may exhibit phenomena such as blurring, and aggressive fine-tuning based on limited spatial location distortion, skewness, or uneven lighting, leading to erroneous information and erroneous OCR text information often leads character recognition by OCR systems, especially in cases to inevitable overfitting. In this paper, we propose a multimodal of low-quality handwriting. Even when OCR systems correctly adversarial training architecture with spatial awareness identify characters, discrete and semantically irrelevant capabilities. Specifically, we introduce an Adversarial OCR recognition results may impact the comprehension of the OCR Enhancement (AOE) module, which leverages adversarial text semantics.
arXiv.org Artificial Intelligence
Mar-14-2024