WearVQA: AVisual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios
–Neural Information Processing Systems
We introduce WearVQA, the first benchmark specifically designed to evaluate the Visual Question Answering (VQA) capabilities of multi-modal AI assistant on wearable devices like smart glasses. Unlike prior benchmarks that focus on highquality, third-person imagery, WearVQA reflects the unique challenges of egocentric interaction--where visual inputs may be occluded, poorly lit, unzoomed, or blurry, and questions are grounded in realistic wearable use cases. The benchmark comprises 2,520 carefully curated image-question-answer triplets, spanning 7 diverse image domains including both text-centric and general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning, and 6 common wearables-specific image quality issues. All questions are designed to be answerable using only the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluation framework with 96% labeling accuracy. Open-source and proprietary multi-modal LLMs achieved a QA accuracy as low as 24-52% on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks.
Neural Information Processing Systems
Jun-22-2026, 22:33:52 GMT
- Country:
- North America > United States (0.46)
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Information Technology > Security & Privacy (0.93)
- Law (0.93)
- Technology: