On Unifying Video Generation and Camera Pose Estimation