SpikeGrasp: A Benchmark for 6-DoF Grasp Pose Detection from Stereo Spike Streams

Gao, Zhuoheng, Zhang, Jiyao, Xie, Zhiyong, Dong, Hao, Yu, Zhaofei, Chen, Rongmei, Chen, Guozhang, Huang, Tiejun

arXiv.org Artificial Intelligence 

Most robotic grasping systems rely on converting sensor data into explicit 3D point clouds, which is a computational step not found in biological intelligence. This paper explores a fundamentally different, neuro-inspired paradigm for 6-DoF grasp detection. We introduce SpikeGrasp, a framework that mimics the biological visuomotor pathway, processing raw, asynchronous events from stereo spike cameras, similarly to retinas, to directly infer grasp poses. Our model fuses these stereo spike streams and uses a recurrent spiking neural network, analogous to high-level visual processing, to iteratively refine grasp hypotheses without ever reconstructing a point cloud. To validate this approach, we built a large-scale synthetic benchmark dataset. Experiments show that SpikeGrasp surpasses traditional point-cloud-based baselines, especially in cluttered and textureless scenes, and demonstrates remarkable data efficiency. By establishing the viability of this end-to-end, neuro-inspired approach, SpikeGrasp paves the way for future systems capable of the fluid and efficient manipulation seen in nature, particularly for dynamic objects. The ability to pick up an arbitrary object is a fundamental measure of intelligence for an autonomous robot. The prevailing approach to this grasp detection problem follows a distinct geometry-first pipeline: capture a scene with sensors, reconstruct a 3D geometric model (typically a point cloud) and then analyze this model for a viable grasp (Fang et al., 2020; Gui et al., 2025). This paradigm is logical from a computer graphics perspective, but is a significant departure from how biological systems operate. The brain does not compute or store explicit point clouds to decide how to grasp a coffee cup (Cao et al., 2025); it leverages a continuous stream of sensory information processed through a highly efficient neural architecture.