Goto

Collaborating Authors

 Otaegui, Oihana


Exploration of VLMs for Driver Monitoring Systems Applications

arXiv.org Artificial Intelligence

VLMs have the potential to revolutionize driver and in-cabin monitoring by offering a more holistic understanding of the driving scene. Rather than focusing on individual variables, VLMs are trained to describe the entire scene, considering all crucial elements. This comprehensive approach allows them to construct a coherent narrative around the scene, leading to a more thorough assessment of the driver's situation. Despite the potential benefits, there is a notable lack of scientific research exploring the application of VLMs in this field. We aim to conduct an initial exploration of how these systems perform in tasks such as distraction detection, drowsiness detection, and gaze estimation. By evaluating their performance, we hope to determine whether they can match or even surpass state-of-the-art models, or identify areas where they fall short. To achieve this, we will utilize data from the Driver Monitoring Dataset (DMD), which contains extensive material of drivers in various states of drowsiness and distraction containing drivers doing several actions that imply distraction like texting, having a phone call, drinking water, besides driving safely, as well as detailed gaze annotations. By integrating VLMs into DMS, we expect the model to: Have better scene comprehension, enabling it to provide detailed descriptions and respond to queries through Visual Question Answering (VQA) tasks.


Learning Gaze-aware Compositional GAN

arXiv.org Artificial Intelligence

Gaze-annotated facial data is crucial for training deep neural networks (DNNs) for gaze estimation. However, obtaining these data is labor-intensive and requires specialized equipment due to the challenge of accurately annotating the gaze direction of a subject. In this work, we present a generative framework to create annotated gaze data by leveraging the benefits of labeled and unlabeled data sources. We propose a Gaze-aware Compositional GAN that learns to generate annotated facial images from a limited labeled dataset. Then we transfer this model to an unlabeled data domain to take advantage of the diversity it provides. Experiments demonstrate our approach's effectiveness in generating within-domain image augmentations in the ETH-XGaze dataset and cross-domain augmentations in the CelebAMask-HQ dataset domain for gaze estimation DNN training. We also show additional applications of our work, which include facial image editing and gaze redirection.