Goto

Collaborating Authors

 saccade


DIJIT: A Robotic Head for an Active Observer

Tabrizi, Mostafa Kamali, Chi, Mingshi, Dey, Bir Bikram, Yuan, Yu Qing, Solbach, Markus D., Liu, Yiqian, Jenkin, Michael, Tsotsos, John K.

arXiv.org Artificial Intelligence

We present DIJIT, a novel binocular robotic head expressly designed for mobile agents that behave as active observers. DIJIT's unique breadth of functionality enables active vision research and the study of human-like eye and head-neck motions, their interrelationships, and how each contributes to visual ability. DIJIT is also being used to explore the differences between how human vision employs eye/head movements to solve visual tasks and current computer vision methods. DIJIT's design features nine mechanical degrees of freedom, while the cameras and lenses provide an additional four optical degrees of freedom. The ranges and speeds of the mechanical design are comparable to human performance. Our design includes the ranges of motion required for convergent stereo, namely, vergence, version, and cyclotorsion. The exploration of the utility of these to both human and machine vision is ongoing. Here, we present the design of DIJIT and evaluate aspects of its performance. We present a new method for saccadic camera movements. In this method, a direct relationship between camera orientation and motor values is developed. The resulting saccadic camera movements are close to human movements in terms of their accuracy.


Emergence of Fixational and Saccadic Movements in a Multi-Level Recurrent Attention Model for Vision

Pan, Pengcheng, Shogo, Yonekura, Kuniyoshi, Yasuo

arXiv.org Artificial Intelligence

Inspired by foveal vision, hard attention models promise interpretability and parameter economy. However, existing models like the Recurrent Model of Visual Attention (RAM) and Deep Recurrent Attention Model (DRAM) failed to model the hierarchy of human vision system, that compromise on the visual exploration dynamics. As a result, they tend to produce attention that are either overly fixational or excessively saccadic, diverging from human eye movement behavior. In this paper, we propose a Multi-Level Recurrent Attention Model (MRAM), a novel hard attention framework that explicitly models the neural hierarchy of human visual processing. By decoupling the function of glimpse location generation and task execution in two recurrent layers, MRAM emergent a balanced behavior between fixation and saccadic movement. Our results show that MRAM not only achieves more human-like attention dynamics, but also consistently outperforms CNN, RAM and DRAM baselines on standard image classification benchmarks.





A Spatio-Temporal Point Process for Fine-Grained Modeling of Reading Behavior

Re, Francesco Ignazio, Opedal, Andreas, Manaiev, Glib, Giulianelli, Mario, Cotterell, Ryan

arXiv.org Artificial Intelligence

Reading is a process that unfolds across space and time, alternating between fixations where a reader focuses on a specific point in space, and saccades where a reader rapidly shifts their focus to a new point. An ansatz of psycholinguistics is that modeling a reader's fixations and saccades yields insight into their online sentence processing. However, standard approaches to such modeling rely on aggregated eye-tracking measurements and models that impose strong assumptions, ignoring much of the spatio-temporal dynamics that occur during reading. In this paper, we propose a more general probabilistic model of reading behavior, based on a marked spatio-temporal point process, that captures not only how long fixations last, but also where they land in space and when they take place in time. The saccades are modeled using a Hawkes process, which captures how each fixation excites the probability of a new fixation occurring near it in time and space. The duration time of fixation events is modeled as a function of fixation-specific predictors convolved across time, thus capturing spillover effects. Empirically, our Hawkes process model exhibits a better fit to human saccades than baselines. With respect to fixation durations, we observe that incorporating contextual surprisal as a predictor results in only a marginal improvement in the model's predictive accuracy. This finding suggests that surprisal theory struggles to explain fine-grained eye movements.


Foveated Instance Segmentation

Zeng, Hongyi, Liu, Wenxuan, Xia, Tianhua, Chen, Jinhui, Li, Ziyun, Zhang, Sai Qian

arXiv.org Artificial Intelligence

Instance segmentation is essential for augmented reality and virtual reality (AR/VR) as it enables precise object recognition and interaction, enhancing the integration of virtual and real-world elements for an immersive experience. However, the high computational overhead of segmentation limits its application on resource-constrained AR/VR devices, causing large processing latency and degrading user experience. In contrast to conventional scenarios, AR/VR users typically focus on only a few regions within their field of view before shifting perspective, allowing segmentation to be concentrated on gaze-specific areas. This insight drives the need for efficient segmentation methods that prioritize processing instance of interest, reducing computational load and enhancing real-time performance. In this paper, we present a foveated instance segmentation (F ovealSeg) framework that leverages real-time user gaze data to perform instance segmentation exclusively on instance of interest, resulting in substantial computational savings. Evaluation results show that FSNet achieves an IoU of 0.56 on ADE20K and 0.54 on LVIS, notably outperforming the baseline. The code is available at https://github.com/SAI-Lab-NYU/


Meta-Representational Predictive Coding: Biomimetic Self-Supervised Learning

Ororbia, Alexander, Friston, Karl, Rao, Rajesh P. N.

arXiv.org Artificial Intelligence

Self-supervised learning has become an increasingly important paradigm in the domain of machine intelligence. Furthermore, evidence for self-supervised adaptation, such as contrastive formulations, has emerged in recent computational neuroscience and brain-inspired research. Nevertheless, current work on self-supervised learning relies on biologically implausible credit assignment -- in the form of backpropagation of errors -- and feedforward inference, typically a forward-locked pass. Predictive coding, in its mechanistic form, offers a biologically plausible means to sidestep these backprop-specific limitations. However, unsupervised predictive coding rests on learning a generative model of raw pixel input (akin to ``generative AI'' approaches), which entails predicting a potentially high dimensional input; on the other hand, supervised predictive coding, which learns a mapping between inputs to target labels, requires human annotation, and thus incurs the drawbacks of supervised learning. In this work, we present a scheme for self-supervised learning within a neurobiologically plausible framework that appeals to the free energy principle, constructing a new form of predictive coding that we call meta-representational predictive coding (MPC). MPC sidesteps the need for learning a generative model of sensory input (e.g., pixel-level features) by learning to predict representations of sensory input across parallel streams, resulting in an encoder-only learning and inference scheme. This formulation rests on active inference (in the form of sensory glimpsing) to drive the learning of representations, i.e., the representational dynamics are driven by sequences of decisions made by the model to sample informative portions of its sensorium.


Consumer-grade EEG-based Eye Tracking

Afonso, Tiago Vasconcelos, Heinrichs, Florian

arXiv.org Artificial Intelligence

EEG-based eye tracking (ET) is emerging as a promising application of brain-computer interfaces (BCIs) (Dietrich et al., 2017; Fuhl et al., 2023; Kastrati et al., 2021; Sun et al., 2023). While EEG is typically used to record the electrical activity of the brain, it also captures eye movement artifacts due to the inherent electrical charge of the eyes. Although these signals are usually considered noise in other BCI applications and are often removed (Croft and Barry, 2000), they can be effectively used to track eye movements. These signals are also easier to decode than brain activity, as they are not complicated by the complexity and noise associated with brain signal interpretation. In addition, achieving reliable and accurate eye tracking using EEG technology could significantly enhance existing consumer BCIs, opening up a wide range of new applications. Apart from the potential for BCI applications, EEG-based eye tracking is an interesting alternative to eye tracking in its own right, offering several advantages over camera-based eye tracking, which is the predominant method used for eye tracking today.


Per Subject Complexity in Eye Movement Prediction

Melnyk, Kateryna, Katrychuk, Dmytro, Friedman, Lee, Komogortsev, Oleg

arXiv.org Artificial Intelligence

Eye movement prediction is a promising area of research to compensate for the latency introduced by eye-tracking systems in virtual reality devices. In this study, we comprehensively analyze the complexity of the eye movement prediction task associated with subjects. We use three fundamentally different models within the analysis: the lightweight Long Short-Term Memory network (LSTM), the transformer-based network for multivariate time series representation learning (TST), and the Oculomotor Plant Mathematical Model wrapped in the Kalman Filter framework (OPKF). Each solution is assessed following a sample-to-event evaluation strategy and employing the new event-to-subject metrics. Our results show that the different models maintained similar prediction performance trends pertaining to subjects. We refer to these outcomes as per-subject complexity since some subjects' data pose a more significant challenge for models. Along with the detailed correlation analysis, this report investigates the source of the per-subject complexity and discusses potential solutions to overcome it.