Evaluating Based Capabilities of LLMs in Video Scenarios
–Neural Information Processing Systems
Multimodal Large Language Models (MLLMs) have achieved considerable accuracy in Optical Character Recognition (OCR) from static images. However, their efficacy in video OCR is significantly diminished due to factors such as motion blur, temporal variations, and visual effects inherent in video content. To provide clearer guidance for training practical MLLMs, we introduce MMEVideoOCR benchmark, which encompasses a comprehensive range of video OCR application scenarios.
Neural Information Processing Systems
Jun-23-2026, 08:51:41 GMT
- Country:
- Asia (0.68)
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.67)
- Research Report
- Industry:
- Media (0.46)
- Technology: