DMAGaze: Gaze Estimation Based on Feature Disentanglement and Multi-Scale Attention

Chen, Haohan, Liu, Hongjia, Lan, Shiyong, Wang, Wenwu, Qiao, Yixin, Li, Yao, Deng, Guonan

arXiv.org Artificial Intelligence 

Gaze estimation, which predicts gaze direction, commonly faces the challenge of interference from complex gaze-irrelevant information in face images. In this work, we propose DMAGaze, a novel gaze estimation framework that exploits information from facial images in three aspects: gaze-relevant global features (disentangled from facial image), local eye features (extracted from cropped eye patch), and head pose estimation features, to improve overall performance. Furthermore, we introduce a new cascaded attention module named Multi-Scale Global Local Attention Module (MS-GLAM). Through a customized cascaded attention structure, it e ffectively focuses on global and local information at multiple scales, further enhancing the information from the Disentangler. Finally, the global gaze-relevant features disentangled by the upper face branch, combined with head pose and local eye features, are passed through the detection head for high-precision gaze estimation. Our proposed DMAGaze has been extensively validated on two mainstream public datasets, achieving state-of-the-art performance. Keywords: gaze estimation, feature disentanglement, Gaussian similarity, multi-scale attention1. Introduction Gaze estimation, the task of predicting gaze direction, crucial for measuring human attention, is widely applied in areas like saliency detection[1, 2], virtual reality[3], driver distraction monitoring[4], human-computer interaction[5] and autism diagnosis[6]. Recently, gaze estimation has shifted from model-based methods to appearance-based methods.