Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering

Shen, Zhixuan, Luo, Haonan, Li, Sijia, Li, Tianrui

arXiv.org Artificial Intelligence 

These methods typically involve largescale Scene-Text Visual Question Answering (ST-VQA) aims to pretraining followed by fine-tuning to adapt the model for understand scene text in images and answer questions related question-answering tasks in text-rich scene images, often ignoring to the text content. Most existing methods heavily rely on the the inevitable OCR text recognition challenges. In practice, accuracy of Optical Character Recognition (OCR) systems, scene images may exhibit phenomena such as blurring, and aggressive fine-tuning based on limited spatial location distortion, skewness, or uneven lighting, leading to erroneous information and erroneous OCR text information often leads character recognition by OCR systems, especially in cases to inevitable overfitting. In this paper, we propose a multimodal of low-quality handwriting. Even when OCR systems correctly adversarial training architecture with spatial awareness identify characters, discrete and semantically irrelevant capabilities. Specifically, we introduce an Adversarial OCR recognition results may impact the comprehension of the OCR Enhancement (AOE) module, which leverages adversarial text semantics.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found