Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering

Shen, Zhixuan, Luo, Haonan, Li, Sijia, Li, Tianrui

Mar-14-2024–arXiv.org Artificial Intelligence

These methods typically involve largescale Scene-Text Visual Question Answering (ST-VQA) aims to pretraining followed by fine-tuning to adapt the model for understand scene text in images and answer questions related question-answering tasks in text-rich scene images, often ignoring to the text content. Most existing methods heavily rely on the the inevitable OCR text recognition challenges. In practice, accuracy of Optical Character Recognition (OCR) systems, scene images may exhibit phenomena such as blurring, and aggressive fine-tuning based on limited spatial location distortion, skewness, or uneven lighting, leading to erroneous information and erroneous OCR text information often leads character recognition by OCR systems, especially in cases to inevitable overfitting. In this paper, we propose a multimodal of low-quality handwriting. Even when OCR systems correctly adversarial training architecture with spatial awareness identify characters, discrete and semantically irrelevant capabilities. Specifically, we introduce an Adversarial OCR recognition results may impact the comprehension of the OCR Enhancement (AOE) module, which leverages adversarial text semantics.

adversarial training, embedding, ocr, (14 more...)

arXiv.org Artificial Intelligence

Mar-14-2024

arXiv.org PDF

Add feedback

Country:
- Asia > China > Sichuan Province (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Vision > Optical Character Recognition (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found