Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models

Mohamud, Safaa Abdullahi Moallim, Baek, Minjin, Han, Dong Seog

arXiv.org Artificial Intelligence 

B. VLM Fine-Tuning and Hierarchical QA Inference For the fine-tuning and inference stages, a compact VLM is employed. All trainable parameters are fine-tuned on the collected dataset to ensure efficient adaptation. The fine-tuned weights are chosen based on the highest VQA accuracy achieved on the validation set. A hierarchical questioning technique is used to mitigate the limitations of smaller VLMs in generating long paragraphs. Instead of generating a single long paragraph, the system asks structured questions in a hierarchical manner, enabling efficient processing and resource minimization. This allows the VLM to generate meaningful scene descriptions without compromising speed or scalability. The hierarchical QA strategy optimizes inference time by dynamically selecting relevant questions based on the visual elements in the scene. For example, if the question "Is the ego vehicle moving on a straight road?" is answered affirmatively, subsequent questions such as "In which direction does the road curve?" are automatically answered as "none," reducing unnecessary computations. Similarly, if the answer to "Is it possible for the ego vehicle to turn right at the intersection?" is "no," other related questions, such as "Is there a vehicle approaching the intersection from the opposite direction?"

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found