skoltech
VLH: Vision-Language-Haptics Foundation Model
Fuentes, Luis Francisco Moreno, Khan, Muhammad Haris, Cabrera, Miguel Altamirano, Serpiva, Valerii, Iarchuk, Dmitri, Mahmoud, Yara, Tokmurziyev, Issatay, Tsetserukou, Dzmitry
We present VLH, a novel Visual-Language-Haptic Foundation Model that unifies perception, language, and tactile feedback in aerial robotics and virtual reality. Unlike prior work that treats haptics as a secondary, reactive channel, VLH synthesizes mid-air force and vibration cues as a direct consequence of contextual visual understanding and natural language commands. Our platform comprises an 8-inch quadcopter equipped with dual inverse five-bar linkage arrays for localized haptic actuation, an egocentric VR camera, and an exocentric top-down view. Visual inputs and language instructions are processed by a fine-tuned OpenVLA backbone - adapted via LoRA on a bespoke dataset of 450 multimodal scenarios - to output a 7-dimensional action vector (Vx, Vy, Vz, Hx, Hy, Hz, Hv). INT8 quantization and a high-performance server ensure real-time operation at 4-5 Hz. In human-robot interaction experiments (90 flights), VLH achieved a 56.7% success rate for target acquisition (mean reach time 21.3 s, pose error 0.24 m) and 100% accuracy in texture discrimination. Generalization tests yielded 70.0% (visual), 54.4% (motion), 40.0% (physical), and 35.0% (semantic) performance on novel tasks. These results demonstrate VLH's ability to co-evolve haptic feedback with perceptual reasoning and intent, advancing expressive, immersive human-robot interactions.
Quadrupedal Robot Skateboard Mounting via Reverse Curriculum Learning
Belov, Danil, Erkhov, Artem, Pestova, Elizaveta, Osokin, Ilya, Tsetserukou, Dzmitry, Osinenko, Pavel
-- The aim of this work is to enable quadrupedal robots to mount skateboards using Reverse Curriculum Reinforcement Learning. Although prior work has demonstrated skateboarding for quadrupeds that are already positioned on the board, the initial mounting phase still poses a significant challenge. A goal-oriented methodology was adopted, beginning with the terminal phases of the task and progressively increasing the complexity of the problem definition to approximate the desired objective. The learning process was initiated with the skateboard rigidly fixed within the global coordinate frame and the robot positioned directly above it. Through gradual relaxation of these initial conditions, the learned policy demonstrated robustness to variations in skateboard position and orientation, ultimately exhibiting a successful transfer to scenarios involving a mobile skateboard. Legged robot locomotion has a number of advantages over the other motion types.
UAV-VLPA*: A Vision-Language-Path-Action System for Optimal Route Generation on a Large Scales
Sautenkov, Oleg, Akhmetkazy, Aibek, Yaqoot, Yasheerah, Mustafa, Muhammad Ahsan, Tadevosyan, Grik, Lykov, Artem, Tsetserukou, Dzmitry
The UAV-VLPA* (Visual-Language-Planning-and-Action) system represents a cutting-edge advancement in aerial robotics, designed to enhance communication and operational efficiency for unmanned aerial vehicles (UAVs). By integrating advanced planning capabilities, the system addresses the Traveling Salesman Problem (TSP) to optimize flight paths, reducing the total trajectory length by 18.5\% compared to traditional methods. Additionally, the incorporation of the A* algorithm enables robust obstacle avoidance, ensuring safe and efficient navigation in complex environments. The system leverages satellite imagery processing combined with the Visual Language Model (VLM) and GPT's natural language processing capabilities, allowing users to generate detailed flight plans through simple text commands. This seamless fusion of visual and linguistic analysis empowers precise decision-making and mission planning, making UAV-VLPA* a transformative tool for modern aerial operations. With its unmatched operational efficiency, navigational safety, and user-friendly functionality, UAV-VLPA* sets a new standard in autonomous aerial robotics, paving the way for future innovations in the field.
GazeGrasp: DNN-Driven Robotic Grasping with Wearable Eye-Gaze Interface
Tokmurziyev, Issatay, Cabrera, Miguel Altamirano, Moreno, Luis, Khan, Muhammad Haris, Tsetserukou, Dzmitry
Abstract--We present GazeGrasp, a gaze-based manipulation system enabling individuals with motor impairments to control collaborative robots using eye-gaze. The system employs an ESP32 CAM for eye tracking, MediaPipe for gaze detection, and YOLOv8 for object localization, integrated with a Universal Robot UR10 for manipulation tasks. After user-specific calibration, the system allows intuitive object selection with a magnetic snapping effect and robot control via eye gestures. Experimental evaluation involving 13 participants demonstrated that the magnetic snapping effect significantly reduced gaze alignment time, improving task efficiency by 31%. GazeGrasp provides a robust, hands-free interface for assistive robotics, enhancing accessibility and autonomy for users.
SafeSwarm: Decentralized Safe RL for the Swarm of Drones Landing in Dense Crowds
Tadevosyan, Grik, Osipenko, Maksim, Aschu, Demetros, Fedoseev, Aleksey, Serpiva, Valerii, Sautenkov, Oleg, Karaf, Sausar, Tsetserukou, Dzmitry
This paper introduces a safe swarm of drones capable of performing landings in crowded environments robustly by relying on Reinforcement Learning techniques combined with Safe Learning. The developed system allows us to teach the swarm of drones with different dynamics to land on moving landing pads in an environment while avoiding collisions with obstacles and between agents. The safe barrier net algorithm was developed and evaluated using a swarm of Crazyflie 2.1 micro quadrotors, which were tested indoors with the Vicon motion capture system to ensure precise localization and control. Experimental results show that our system achieves landing accuracy of 2.25 cm with a mean time of 17 s and collision-free landings, underscoring its effectiveness and robustness in real-world scenarios. This work offers a promising foundation for applications in environments where safety and precision are paramount.
ViewVR: Visual Feedback Modes to Achieve Quality of VR-based Telemanipulation
Erkhov, A., Bazhenov, A., Satsevich, S., Belov, D., Khabibullin, F., Egorov, S., Gromakov, M., Cabrera, M. Altamirano, Tsetserukou, D.
Abstract--The paper focuses on an immersive teleoperation system that enhances operator's ability to actively perceive the robot's surroundings. A consumer-grade HTC Vive VR system was used to synchronize the operator's hand and head movements with a UR3 robot and a custom-built robotic head with two degrees of freedom (2-DoF). The system's usability, manipulation efficiency, and intuitiveness of control were evaluated in comparison with static head camera positioning across three distinct tasks. Teleoperation plays a pivotal role in robotics by enabling efficient data collection for learning from demonstrations. The quality of collected data heavily depends on the operator's ability to intuitively control the system and receive adaptive visual feedback.
Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing
Khan, Muhamamd Haris, Asfaw, Selamawit, Iarchuk, Dmitrii, Cabrera, Miguel Altamirano, Moreno, Luis, Tokmurziyev, Issatay, Tsetserukou, Dzmitry
This paper introduces Shake-VLA, a Vision-Language-Action (VLA) model-based system designed to enable bimanual robotic manipulation for automated cocktail preparation. The system integrates a vision module for detecting ingredient bottles and reading labels, a speech-to-text module for interpreting user commands, and a language model to generate task-specific robotic instructions. Force Torque (FT) sensors are employed to precisely measure the quantity of liquid poured, ensuring accuracy in ingredient proportions during the mixing process. The system architecture includes a Retrieval-Augmented Generation (RAG) module for accessing and adapting recipes, an anomaly detection mechanism to address ingredient availability issues, and bimanual robotic arms for dexterous manipulation. Experimental evaluations demonstrated a high success rate across system components, with the speech-to-text module achieving a 93% success rate in noisy environments, the vision module attaining a 91% success rate in object and label detection in cluttered environment, the anomaly module successfully identified 95% of discrepancies between detected ingredients and recipe requirements, and the system achieved an overall success rate of 100% in preparing cocktails, from recipe formulation to action generation.
DogSurf: Quadruped Robot Capable of GRU-based Surface Recognition for Blind Person Navigation
Bazhenov, Artem, Berman, Vladimir, Satsevich, Sergei, Shalopanova, Olga, Cabrera, Miguel Altamirano, Lykov, Artem, Tsetserukou, Dzmitry
This paper introduces DogSurf - a newapproach of using quadruped robots to help visually impaired people navigate in real world. The presented method allows the quadruped robot to detect slippery surfaces, and to use audio and haptic feedback to inform the user when to stop. A state-of-the-art GRU-based neural network architecture with mean accuracy of 99.925% was proposed for the task of multiclass surface classification for quadruped robots. A dataset was collected on a Unitree Go1 Edu robot. The dataset and code have been posted to the public domain.
CognitiveDog: Large Multimodal Model Based System to Translate Vision and Language into Action of Quadruped Robot
Lykov, Artem, Litvinov, Mikhail, Konenkov, Mikhail, Prochii, Rinat, Burtsev, Nikita, Abdulkarim, Ali Alridha, Bazhenov, Artem, Berman, Vladimir, Tsetserukou, Dzmitry
This paper introduces CognitiveDog, a pioneering development of quadruped robot with Large Multi-modal Model (LMM) that is capable of not only communicating with humans verbally but also physically interacting with the environment through object manipulation. The system was realized on Unitree Go1 robot-dog equipped with a custom gripper and demonstrated autonomous decision-making capabilities, independently determining the most appropriate actions and interactions with various objects to fulfill user-defined tasks. These tasks do not necessarily include direct instructions, challenging the robot to comprehend and execute them based on natural language input and environmental cues. The paper delves into the intricacies of this system, dataset characteristics, and the software architecture. Key to this development is the robot's proficiency in navigating space using Visual-SLAM, effectively manipulating and transporting objects, and providing insightful natural language commentary during task execution. Experimental results highlight the robot's advanced task comprehension and adaptability, underscoring its potential in real-world applications. The dataset used to fine-tune the robot-dog behavior generation model is provided at the following link: huggingface.co/datasets/ArtemLykov/CognitiveDog_dataset
HyperDog: An Open-Source Quadruped Robot Platform Based on ROS2 and micro-ROS
Mudalige, Nipun Dhananjaya Weerakkodi, Zhura, Iana, Babataev, Ildar, Nazarova, Elena, Fedoseev, Aleksey, Tsetserukou, Dzmitry
Nowadays, design and development of legged quadruped robots is a quite active area of scientific research. In fact, the legged robots have become popular due to their capabilities to adapt to harsh terrains and diverse environmental conditions in comparison to other mobile robots. With the higher demand for legged robot experiments, more researches and engineers need an affordable and quick way of locomotion algorithm development. In this paper, we present a new open source quadruped robot HyperDog platform, which features 12 RC servo motors, onboard NVIDIA Jetson nano computer and STM32F4 Discovery board. HyperDog is an open-source platform for quadruped robotic software development, which is based on Robot Operating System 2 (ROS2) and micro-ROS. Moreover, the HyperDog is a quadrupedal robotic dog entirely built from 3D printed parts and carbon fiber, which allows the robot to have light weight and good strength. The idea of this work is to demonstrate an affordable and customizable way of robot development and provide researches and engineers with the legged robot platform, where different algorithms can be tested and validated in simulation and real environment. The developed project with code is available on GitHub (https://github.com/NDHANA94/hyperdog_ros2).