ıFinder: Structured Zero-Shot Vision-Based LLMGrounding for Dash-Cam Video Reasoning

Neural Information Processing Systems 

Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e.