ROSA: Addressing text understanding challenges in photographs via ROtated SAmpling

Maina, Hernán, Ivetta, Guido, Stuto, Mateo Lione, Eisenschlos, Julian Martin, Sánchez, Jorge, Benotti, Luciana

Jun-5-2025–arXiv.org Artificial Intelligence

Visually impaired people could benefit from Visual Question Answering (VQA) systems to interpret text in their surroundings. However, current models often struggle with recognizing text in the photos taken by this population. Through in-depth interviews with visually impaired individuals, we identified common framing conventions that frequently result in misaligned text. Existing VQA benchmarks primarily feature well-oriented text captured by sighted users, under-representing these challenges. To address this gap, we introduce ROtated SAm-pling ( ROSA), a decoding strategy that enhances VQA performance in text-rich images with incorrectly oriented text. ROSA outperforms Greedy decoding by 11.7 absolute points in the best-performing model.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Jun-5-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report > New Finding (1.00)
- Questionnaire & Opinion Survey (1.00)
- Personal > Interview (0.66)

Industry:
- Health & Medicine (0.73)
- Education (0.68)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Natural Language (1.00)
    - Vision (0.94)
    - Machine Learning > Neural Networks
      - Deep Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found