Hybrid Vision Servoing with Depp Alignment and GRU-Based Occlusion Recovery

Lee, Jee Won, Lim, Hansol, Yang, Sooyeun, Choi, Jongseong Brad

arXiv.org Artificial Intelligence 

Traditional robotic controllers have long relied on proprioceptive sensors such as joint encoders, inertial measurement units, and force - torque sensors to estimate position and motion, but these often suffer from drift, calibration errors, and limited environmental awareness [1]. Image - based visual servoing has therefore been widely adopted for high - precision robotic assembly, aerial vehicle stabilization, and minimally invasive surgery, where direct visual feedback can compensate for model uncertainties an d encoder inaccuracies [2] [3]. In these closed - loop systems, perception must deliver sub - pixel localization accuracy at control rates above 30 Hz while tolerating partial or full occlusions, illumination shifts, and motion blur to maintain loop stability and precision [4]. Even millimeter - level tracking errors can accumulate into significant actuation drift, undermining safety and performance into sub - millimeter surgical targeting or centimeter - scale drone landing [5] [6]. Early IBVS methods emerged in the early 1990s to simplify robot control by directly mapping image features to velocity commands, establishing the foundation for image - space loop closure [2]. Handcrafted detectors such as SIFT [7], which identifies scale - invariant keypoints, SURF [8], which accelerates detection using integral images, and ORB [9], which offers an efficient binary alternative, were paired with RANSASC [10] to filter out mismatches. However, these sparse approaches struggled when keypoints wer e lost to occlusion or blur. To achieve denser alignment, the Lucas - Kanade algorithm was introduced to iteratively minimize photometric error over image patches and enable smooth sub - pixel registration [11].