Not enough data to create a plot.
Try a different view from the menu above.
Cabrera, Miguel Altamirano
Evolution 6.0: Evolving Robotic Capabilities Through Generative Design
Khan, Muhammad Haris, Myshlyaev, Artyom, Lykov, Artem, Cabrera, Miguel Altamirano, Tsetserukou, Dzmitry
We propose a new concept, Evolution 6.0, which represents the evolution of robotics driven by Generative AI. When a robot lacks the necessary tools to accomplish a task requested by a human, it autonomously designs the required instruments and learns how to use them to achieve the goal. Evolution 6.0 is an autonomous robotic system powered by Vision-Language Models (VLMs), Vision-Language Action (VLA) models, and Text-to-3D generative models for tool design and task execution. The system comprises two key modules: the Tool Generation Module, which fabricates task-specific tools from visual and textual data, and the Action Generation Module, which converts natural language instructions into robotic actions. It integrates QwenVLM for environmental understanding, OpenVLA for task execution, and Llama-Mesh for 3D tool generation. Evaluation results demonstrate a 90% success rate for tool generation with a 10-second inference time, and action generation achieving 83.5% in physical and visual generalization, 70% in motion generalization, and 37% in semantic generalization. Future improvements will focus on bimanual manipulation, expanded task capabilities, and enhanced environmental interpretation to improve real-world adaptability.
GazeGrasp: DNN-Driven Robotic Grasping with Wearable Eye-Gaze Interface
Tokmurziyev, Issatay, Cabrera, Miguel Altamirano, Moreno, Luis, Khan, Muhammad Haris, Tsetserukou, Dzmitry
Abstract--We present GazeGrasp, a gaze-based manipulation system enabling individuals with motor impairments to control collaborative robots using eye-gaze. The system employs an ESP32 CAM for eye tracking, MediaPipe for gaze detection, and YOLOv8 for object localization, integrated with a Universal Robot UR10 for manipulation tasks. After user-specific calibration, the system allows intuitive object selection with a magnetic snapping effect and robot control via eye gestures. Experimental evaluation involving 13 participants demonstrated that the magnetic snapping effect significantly reduced gaze alignment time, improving task efficiency by 31%. GazeGrasp provides a robust, hands-free interface for assistive robotics, enhancing accessibility and autonomy for users.
Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing
Khan, Muhamamd Haris, Asfaw, Selamawit, Iarchuk, Dmitrii, Cabrera, Miguel Altamirano, Moreno, Luis, Tokmurziyev, Issatay, Tsetserukou, Dzmitry
This paper introduces Shake-VLA, a Vision-Language-Action (VLA) model-based system designed to enable bimanual robotic manipulation for automated cocktail preparation. The system integrates a vision module for detecting ingredient bottles and reading labels, a speech-to-text module for interpreting user commands, and a language model to generate task-specific robotic instructions. Force Torque (FT) sensors are employed to precisely measure the quantity of liquid poured, ensuring accuracy in ingredient proportions during the mixing process. The system architecture includes a Retrieval-Augmented Generation (RAG) module for accessing and adapting recipes, an anomaly detection mechanism to address ingredient availability issues, and bimanual robotic arms for dexterous manipulation. Experimental evaluations demonstrated a high success rate across system components, with the speech-to-text module achieving a 93% success rate in noisy environments, the vision module attaining a 91% success rate in object and label detection in cluttered environment, the anomaly module successfully identified 95% of discrepancies between detected ingredients and recipe requirements, and the system achieved an overall success rate of 100% in preparing cocktails, from recipe formulation to action generation.
UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation
Sautenkov, Oleg, Yaqoot, Yasheerah, Lykov, Artem, Mustafa, Muhammad Ahsan, Tadevosyan, Grik, Akhmetkazy, Aibek, Cabrera, Miguel Altamirano, Martynov, Mikhail, Karaf, Sausar, Tsetserukou, Dzmitry
The UAV-VLA (Visual-Language-Action) system is a tool designed to facilitate communication with aerial robots. By integrating satellite imagery processing with the Visual Language Model (VLM) and the powerful capabilities of GPT, UAV-VLA enables users to generate general flight paths-and-action plans through simple text requests. This system leverages the rich contextual information provided by satellite images, allowing for enhanced decision-making and mission planning. The combination of visual analysis by VLM and natural language processing by GPT can provide the user with the path-and-action set, making aerial operations more efficient and accessible. The newly developed method showed the difference in the length of the created trajectory in 22% and the mean error in finding the objects of interest on a map in 34.22 m by Euclidean distance in the K-Nearest Neighbors (KNN) approach.
TiltXter: CNN-based Electro-tactile Rendering of Tilt Angle for Telemanipulation of Pasteur Pipettes
Cabrera, Miguel Altamirano, Tirado, Jonathan, Fedoseev, Aleksey, Sautenkov, Oleg, Poliakov, Vladimir, Kopanev, Pavel, Tsetserukou, Dzmitry
The shape of deformable objects can change drastically during grasping by robotic grippers, causing an ambiguous perception of their alignment and hence resulting in errors in robot positioning and telemanipulation. Rendering clear tactile patterns is fundamental to increasing users' precision and dexterity through tactile haptic feedback during telemanipulation. Therefore, different methods have to be studied to decode the sensors' data into haptic stimuli. This work presents a telemanipulation system for plastic pipettes that consists of a Force Dimension Omega.7 haptic interface endowed with two electro-stimulation arrays and two tactile sensor arrays embedded in the 2-finger Robotiq gripper. We propose a novel approach based on convolutional neural networks (CNN) to detect the tilt of deformable objects. The CNN generates a tactile pattern based on recognized tilt data to render further electro-tactile stimuli provided to the user during the telemanipulation. The study has shown that using the CNN algorithm, tilt recognition by users increased from 23.13\% with the downsized data to 57.9%, and the success rate during teleoperation increased from 53.12% using the downsized data to 92.18% using the tactile patterns generated by the CNN.
GazeRace: Revolutionizing Remote Piloting with Eye-Gaze Control
Tokmurziyev, Issatay, Serpiva, Valerii, Fedoseev, Alexey, Cabrera, Miguel Altamirano, Tsetserukou, Dzmitry
This paper introduces the GazeRace method for drone navigation, employing a computer vision interface facilitated by eye-tracking technology. This interface is designed to be compatible with a single camera and uses a convolutional neural network to convert eye movements into control commands for the drone. Experimental validation demonstrates that users equipped with the eye-tracking interface achieve comparable performance to a traditional remote control interface when completing a drone racing task. Ten participants completed flight tests in which they navigated a drone through a racing track in a Gazebo simulation environment. Users reduced drone trajectory length by 18% (73.44 m vs. 89.29 m) using the eye-tracking interface to navigate racing gates effectively. The time taken to complete the route using the eye-tracking method (average of 70.01 seconds) was only 3.5% slower than using the remote control method (also average of 70.01 seconds), indicating the good efficiency of the interface. It is also worth mentioning that four of the participants completed the race with an average time that was 25.9% faster than the other participants. In addition, users evaluated highly the performance (M = 34.0, SD = 14.2) and low frustration (M = 30.5, SD = 9.2) with the eye-tracking interface compared to performance (M = 63.0, SD = 10.1) and frustration (M = 49.0, SD = 11.7) with the baseline remote controller. The hedonic quality (M = 1.65, SD = 0.45) was also evaluated high by the users in the UEQ questionnaire.
MoveTouch: Robotic Motion Capturing System with Wearable Tactile Display to Achieve Safe HRI
Alabbas, Ali, Cabrera, Miguel Altamirano, Sayed, Mohamed, Alyounes, Oussama, Liu, Qian, Tsetserukou, Dzmitry
The collaborative robot market is flourishing as there is a trend towards simplification, modularity, and increased flexibility on the production line. But when humans and robots are collaborating in a shared environment, the safety of humans should be a priority. We introduce a novel wearable robotic system to enhance safety during Human-Robot Interaction (HRI). The proposed wearable robot is designed to hold a fiducial marker and maintain its visibility to a motion capture system, which, in turn, localizes the user's hand with good accuracy and low latency and provides vibrotactile feedback to the user's wrist. The vibrotactile feedback guides the user's hand movement during collaborative tasks in order to increase safety and enhance collaboration efficiency. A user study was conducted to assess the recognition and discriminability of ten designed vibration patterns applied to the upper (dorsal) and the down (volar) parts of the user's wrist. The results show that the pattern recognition rate on the volar side was higher, with an average of 75.64% among all users. Four patterns with a high recognition rate were chosen to be incorporated into our system. A second experiment was carried out to evaluate users' response to the chosen patterns in real-world collaborative tasks. Results show that all participants responded to the patterns correctly, and the average response time for the patterns was between 0.24 and 2.41 seconds.
Robots Can Feel: LLM-based Framework for Robot Ethical Reasoning
Lykov, Artem, Cabrera, Miguel Altamirano, Gbagbe, Koffivi Fidèle, Tsetserukou, Dzmitry
This paper presents the development of a novel ethical reasoning framework for robots. "Robots Can Feel" is the first system for robots that utilizes a combination of logic and human-like emotion simulation to make decisions in morally complex situations akin to humans. The key feature of the approach is the management of the Emotion Weight Coefficient - a customizable parameter to assign the role of emotions in robot decision-making. The system aims to serve as a tool that can equip robots of any form and purpose with ethical behavior close to human standards. Besides the platform, the system is independent of the choice of the base model. During the evaluation, the system was tested on 8 top up-to-date LLMs (Large Language Models). This list included both commercial and open-source models developed by various companies and countries. The research demonstrated that regardless of the model choice, the Emotions Weight Coefficient influences the robot's decision similarly. According to ANOVA analysis, the use of different Emotion Weight Coefficients influenced the final decision in a range of situations, such as in a request for a dietary violation F(4, 35) = 11.2, p = 0.0001 and in an animal compassion situation F(4, 35) = 8.5441, p = 0.0001. A demonstration code repository is provided at: https://github.com/TemaLykov/robots_can_feel
Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations
Gbagbe, Koffivi Fidèle, Cabrera, Miguel Altamirano, Alabbas, Ali, Alyunes, Oussama, Lykov, Artem, Tsetserukou, Dzmitry
Abstract-- This research introduces the Bi-VLA (Vision-Language-Action) model, a novel system designed for bimanual robotic dexterous manipulation that seamlessly integrates vision for scene understanding, language comprehension for translating human instructions into executable code, and physical action generation. We evaluated the system's functionality through a series of household tasks, including the preparation of a desired salad upon human request. Bi-VLA demonstrates the ability to interpret complex human instructions, perceive and understand the visual context of ingredients, and execute precise bimanual actions to prepare the requested salad. We assessed the system's performance in terms of accuracy, efficiency, and adaptability to different salad recipes and human preferences through a series of experiments. Our results show a 100% success rate in generating the correct executable code by the Language Module, a 96.06% success rate in detecting specific ingredients by the Vision Module, and an overall success rate of 83.4% in However, despite their potential, the application of language models Recent advancements in language models have significantly to synthesize the bimanual skills of robots has not received impacted Human-Robot Interaction (HRI), enabling significant attention.
DogSurf: Quadruped Robot Capable of GRU-based Surface Recognition for Blind Person Navigation
Bazhenov, Artem, Berman, Vladimir, Satsevich, Sergei, Shalopanova, Olga, Cabrera, Miguel Altamirano, Lykov, Artem, Tsetserukou, Dzmitry
This paper introduces DogSurf - a newapproach of using quadruped robots to help visually impaired people navigate in real world. The presented method allows the quadruped robot to detect slippery surfaces, and to use audio and haptic feedback to inform the user when to stop. A state-of-the-art GRU-based neural network architecture with mean accuracy of 99.925% was proposed for the task of multiclass surface classification for quadruped robots. A dataset was collected on a Unitree Go1 Edu robot. The dataset and code have been posted to the public domain.