Zero-Splat TeleAssist: A Zero-Shot Pose Estimation Framework for Semantic Teleoperation

Dokania, Srijan, Raghavan, Dharini

arXiv.org Artificial Intelligence 

Abstract--We introduce Zero-Splat T eleAssist, a zero-shot sensor-fusion pipeline that transforms commodity CCTV streams into a shared, 6-DoF world model for multilateral teleopera-tion. By integrating vision-language segmentation, monocular depth, weighted-PCA pose extraction and 3-D Gaussian Splatting (3DGS), T eleAssist provides every operator with real-time global positions and orientations of multiple robots without fiducials or depth sensors in an interaction-centric teleoperation. Teleoperating robots in complex or remote environments is challenging due to limited on-board perception, occlusions, and operator cognitive load. Traditional teleoperation relies on the robot's sensors (cameras, LiDAR, IMU) which often experiences narrow fields of view, occlusions, cumulative drift, collectively increasing the cognitive load on human operators who must maintain situational awareness. Meanwhile, external camera infrastructures (e.g., CCTV) have potential to provide complementary visual coverage and global contextualization, but conventional solutions rely heavily on visual fiducials, such as AprilTags or ArUco markers [5], or motion-capture systems requiring controlled lighting and calibration processes.