Goto

Collaborating Authors

 mediapipe


Reconnaissance Automatique des Langues des Signes : Une Approche Hybridée CNN-LSTM Basée sur Mediapipe

Takouchouang, Fraisse Sacré, Vinh, Ho Tuong

arXiv.org Artificial Intelligence

Sign languages play a crucial role in the communication of deaf communities, but they are often marginalized, limiting access to essential services such as healthcare and education. This study proposes an automatic sign language recognition system based on a hybrid CNN-LSTM architecture, using Mediapipe for gesture keypoint extraction. Developed with Python, TensorFlow and Streamlit, the system provides real-time gesture translation. The results show an average accuracy of 92\%, with very good performance for distinct gestures such as ``Hello'' and ``Thank you''. However, some confusions remain for visually similar gestures, such as ``Call'' and ``Yes''. This work opens up interesting perspectives for applications in various fields such as healthcare, education and public services.


Evaluation of facial landmark localization performance in a surgical setting

Frajtag, Ines, Švaco, Marko, Šuligoj, Filip

arXiv.org Artificial Intelligence

The use of robotics, computer vision, and their applications is becoming increasingly widespread in various fields, including medicine. Many face detection algorithms have found applications in neurosurgery, ophthalmology, and plastic surgery. A common challenge in using these algorithms is variable lighting conditions and the flexibility of detection positions to identify and precisely localize patients. The proposed experiment tests the MediaPipe algorithm for detecting facial landmarks in a controlled setting, using a robotic arm that automatically adjusts positions while the surgical light and the phantom remain in a fixed position. The results of this study demonstrate that the improved accuracy of facial landmark detection under surgical lighting significantly enhances the detection performance at larger yaw and pitch angles. The increase in standard deviation/dispersion occurs due to imprecise detection of selected facial landmarks. This analysis allows for a discussion on the potential integration of the MediaPipe algorithm into medical procedures.


TSLFormer: A Lightweight Transformer Model for Turkish Sign Language Recognition Using Skeletal Landmarks

Ertürk, Kutay, Altınışık, Furkan, Sarıaltın, İrem, Gerek, Ömer Nezih

arXiv.org Artificial Intelligence

--This study presents TSLFormer, a light and robust word-level T urkish Sign Language (TID) recognition model that treats sign gestures as ordered, string-like language. In contrast to working with raw RGB or depth videos, our method only works with 3D joint positions--articulation points--extracted using Google's Mediapipe library, which focuses on the hand and torso skeletal locations. This creates efficient input dimensionality reduction with significant preservation of important semantic information of the gesture. Our approach revisits sign language recognition as sequence-to-sequence translation, drawing inspiration from sign languages' linguistic nature and transformer's success at natural language translation. Since TSLFormer adapts the transformers' self-attention mechanism, it effectively represents the temporal co-occurrence of a sign sequence, stressing significant movement habits over time as words are referenced in a sentence. Experimented and validated on the AUTSL dataset holding over 36,000 sign samples of over 226 different words, the TSLFormer achieves competitive performance and with minimal computational demands. From the experimentation, rich spatiotemporal understanding of signs is evidenced, and using only joint landmarks, it is possible within any real-time, mobile, and assistive technology facilitating communication between hearing-impaired members. Sign language is an essential communication method for the hearing impaired to express ideas and sentiments through hand gestures, facial expressions, and body movement. Unlike spoken languages, which employ auditory and verbal modalities, sign language utilizes visual and spatial modalities to express meaning. However, despite the limited number of sign language proficient individuals, communication gaps still exist to hinder inclusion--particularly in social interaction on a daily basis and in employment, educational, and healthcare environments.


A Real-Time Gesture-Based Control Framework

Khazaei, Mahya, Bahrani, Ali, Tzanetakis, George

arXiv.org Artificial Intelligence

We introduce a real-time, human-in-the-loop gesture control framework that can dynamically adapt audio and music based on human movement by analyzing live video input. By creating a responsive connection between visual and auditory stimuli, this system enables dancers and performers to not only respond to music but also influence it through their movements. Designed for live performances, interactive installations, and personal use, it offers an immersive experience where users can shape the music in real time. The framework integrates computer vision and machine learning techniques to track and interpret motion, allowing users to manipulate audio elements such as tempo, pitch, effects, and playback sequence. With ongoing training, it achieves user-independent functionality, requiring as few as 50 to 80 samples to label simple gestures. This framework combines gesture training, cue mapping, and audio manipulation to create a dynamic, interactive experience. Gestures are interpreted as input signals, mapped to sound control commands, and used to naturally adjust music elements, showcasing the seamless interplay between human interaction and machine response.


Real-Time Imitation of Human Head Motions, Blinks and Emotions by Nao Robot: A Closed-Loop Approach

Rayati, Keyhan, Feizi, Amirhossein, Beigy, Alireza, Shahverdi, Pourya, Masouleh, Mehdi Tale, Kalhor, Ahmad

arXiv.org Artificial Intelligence

--This paper introduces a novel approach for enabling real-time imitation of human head motion by a Nao robot, with a primary focus on elevating human-robot interactions. By using the robust capabilities of the MediaPipe as a computer vision library and the DeepFace as an emotion recognition library, this research endeavors to capture the subtleties of human head motion, including blink actions and emotional expressions, and seamlessly incorporate these indicators into the robot's responses. The result is a comprehensive framework which facilitates precise head imitation within human-robot interactions, utilizing a closed-loop approach that involves gathering real-time feedback from the robot's imitation performance. This feedback loop ensures a high degree of accuracy in modeling head motion, as evidenced by an impressive R2 score of 96.3 for pitch and 98.9 for yaw. Notably, the proposed approach holds promise in improving communication for children with autism, offering them a valuable tool for more effective interaction. In essence, proposed work explores the integration of real-time head imitation and real-time emotion recognition to enhance human-robot interactions, with potential benefits for individuals with unique communication needs. The field of robotics has come a long way in recent years, with significant advancements in the development of humanoid robots.


Pose-Based Fall Detection System: Efficient Monitoring on Standard CPUs

Mali, Vinayak, Jaiswal, Saurabh

arXiv.org Artificial Intelligence

Falls among elderly residents in assisted living homes pose significant health risks, often leading to injuries and a decreased quality of life. Current fall detection solutions typically rely on sensor-based systems that require dedicated hardware, or on video-based models that demand high computational resources and GPUs for real-time processing. In contrast, this paper presents a robust fall detection system that does not require any additional sensors or high-powered hardware. The system uses pose estimation techniques, combined with threshold-based analysis and a voting mechanism, to effectively distinguish between fall and non-fall activities. For pose detection, we leverage MediaPipe, a lightweight and efficient framework that enables real-time processing on standard CPUs with minimal computational overhead. By analyzing motion, body position, and key pose points, the system processes pose features with a 20-frame buffer, minimizing false positives and maintaining high accuracy even in real-world settings. This unobtrusive, resource-efficient approach provides a practical solution for enhancing resident safety in old age homes, without the need for expensive sensors or high-end computational resources.


Emotion estimation from video footage with LSTM

Attrah, Samer

arXiv.org Artificial Intelligence

Emotion estimation in general is a field that has been studied for a long time, and several approaches exist using machine learning. in this paper, we present an LSTM model, that processes the blend-shapes produced by the library MediaPipe, for a face detected in a live stream of a camera, to estimate the main emotion from the facial expressions, this model is trained on the FER2013 dataset and delivers a result of 71% accuracy and 62% f1-score which meets the accuracy benchmark of the FER2013 dataset, with significantly reduced computation costs. https://github.com/Samir-atra/Emotion_estimation_from_video_footage_with_LSTM_ML_algorithm


R2I-rPPG: A Robust Region of Interest Selection Method for Remote Photoplethysmography to Extract Heart Rate

Nagar, Sandeep, Hasegawa-Johnson, Mark, Beiser, David G., Ahuja, Narendra

arXiv.org Artificial Intelligence

The COVID-19 pandemic has underscored the need for low-cost, scalable approaches to measuring contactless vital signs, either during initial triage at a healthcare facility or virtual telemedicine visits. Remote photoplethysmography (rPPG) can accurately estimate heart rate (HR) when applied to close-up videos of healthy volunteers in well-lit laboratory settings. However, results from such highly optimized laboratory studies may not be readily translated to healthcare settings. One significant barrier to the practical application of rPPG in health care is the accurate localization of the region of interest (ROI). Clinical or telemedicine visits may involve sub-optimal lighting, movement artifacts, variable camera angle, and subject distance. This paper presents an rPPG ROI selection method based on 3D facial landmarks and patient head yaw angle. We then demonstrate the robustness of this ROI selection method when coupled to the Plane-Orthogonal-to-Skin (POS) rPPG method when applied to videos of patients presenting to an Emergency Department for respiratory complaints. Our results demonstrate the effectiveness of our proposed approach in improving the accuracy and robustness of rPPG in a challenging clinical environment.


Allo-AVA: A Large-Scale Multimodal Conversational AI Dataset for Allocentric Avatar Gesture Animation

Punjwani, Saif, Heck, Larry

arXiv.org Artificial Intelligence

The scarcity of high-quality, multimodal training data severely hinders the creation of lifelike avatar animations for conversational AI in virtual environments. Existing datasets often lack the intricate synchronization between speech, facial expressions, and body movements that characterize natural human communication. To address this critical gap, we introduce Allo-AVA, a large-scale dataset specifically designed for text and audio-driven avatar gesture animation in an allocentric (third person point-of-view) context. Allo-AVA consists of $\sim$1,250 hours of diverse video content, complete with audio, transcripts, and extracted keypoints. Allo-AVA uniquely maps these keypoints to precise timestamps, enabling accurate replication of human movements (body and facial gestures) in synchronization with speech. This comprehensive resource enables the development and evaluation of more natural, context-aware avatar animation models, potentially transforming applications ranging from virtual reality to digital assistants.


POSE: Pose estimation Of virtual Sync Exhibit system

Tsui, Hao-Tang, Tuan, Yu-Rou, Chen, Jia-You

arXiv.org Artificial Intelligence

Abstract--Our project is a portable MetaVerse implementation, and we use 3D pose estimation with AI to make virtual avatars do synchronized actions and interact with the environment. The motivation is that we find it inconvenient to use joysticks and sensors when playing with fitness rings. In order to replace joysticks and reduce costs, we develop a platform that can control virtual avatars through pose estimation to identify the movements of real people, and we also implement an multi-process to achieve modularization and reduce the overall latency. As s the Wii swept the world and opened the era of home Figure 1 shows, Figure 1(a) is the keypoints diagram of different game consoles, the technology of detecting player movements human poses, and Figure 1(b) is what the model get after pose has become increasingly essential. As a pioneer, Wii used estimation.