hand gesture
SignMouth: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion
Wu, Wenfang, Yuan, Tingting, Li, Yupeng, Wang, Daling, Fu, Xiaoming
Sign language translation (SLT) aims to translate natural language from sign language videos, serving as a vital bridge for inclusive communication. While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38.
- Europe > North Sea (0.04)
- Atlantic Ocean > North Atlantic Ocean > North Sea (0.04)
- Asia > China > Hong Kong (0.04)
- (5 more...)
Robust Understanding of Human-Robot Social Interactions through Multimodal Distillation
Bian, Tongfei, Chollet, Mathieu, Guha, Tanaya
There is a growing need for social robots and intelligent agents that can effectively interact with and support users. For the interactions to be seamless, the agents need to analyse social scenes and behavioural cues from their (robot's) perspective. Works that model human-agent interactions in social situations are few; and even those existing ones are computationally too intensive to be deployed in real time or perform poorly in real-world scenarios when only limited information is available. We propose a knowledge distillation framework that models social interactions through various multimodal cues, and yet is robust against incomplete and noisy information during inference. We train a teacher model with multimodal input (body, face and hand gestures, gaze, raw images) that transfers knowledge to a student model which relies solely on body pose. Extensive experiments on two publicly available human-robot interaction datasets demonstrate that our student model achieves an average accuracy gain of 14.75% over competitive baselines on multiple downstream social understanding tasks, even with up to 51% of its input being corrupted. The student model is also highly efficient - less than 1% in size of the teacher model in terms of parameters and its latency is 11.9% of the teacher model. Our code and related data are available at github.com/biantongfei/SocialEgoMobile.
- Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
- Europe > United Kingdom > Scotland > City of Glasgow > Glasgow (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > New York > New York County > New York City (0.04)
Human Interaction for Collaborative Semantic SLAM using Extended Reality
Ribeiro, Laura, Shaheer, Muhammad, Fernandez-Cortizas, Miguel, Tourani, Ali, Voos, Holger, Sanchez-Lopez, Jose Luis
Abstract-- Semantic SLAM (Simultaneous Localization and Mapping) systems enrich robot maps with structural and semantic information, enabling robots to operate more effectively in complex environments. However, these systems struggle in real-world scenarios with occlusions, incomplete data, or ambiguous geometries, as they cannot fully leverage the higher-level spatial and semantic knowledge humans naturally apply. We introduce HICS-SLAM, a Human-in-the-Loop semantic SLAM framework that uses a shared extended reality environment for real-time collaboration. The system allows human operators to directly interact with and visualize the robot's 3D scene graph, and add high-level semantic concepts (e.g., rooms or structural entities) into the mapping process. We propose a graph-based semantic fusion methodology that integrates these human interventions with robot perception, enabling scalable collaboration for enhanced situational awareness. Experimental evaluations on real-world construction site datasets demonstrate improvements in room detection accuracy, map precision, and semantic completeness compared to automated baselines, demonstrating both the effectiveness of the approach and its potential for future extensions.
- Asia > Japan (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
- (2 more...)
X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents
Song, Guoxian, Xu, Hongyi, Zhao, Xiaochen, Xie, You, Gu, Tianpei, Li, Zenan, Zhang, Chenxu, Luo, Linjie
We present X-UniMotion, a unified and expressive implicit latent representation for whole-body human motion, encompassing facial expressions, body poses, and hand gestures. Unlike prior motion transfer methods that rely on explicit skeletal poses and heuristic cross-identity adjustments, our approach encodes multi-granular motion directly from a single image into a compact set of four disentangled latent tokens -- one for facial expression, one for body pose, and one for each hand. These motion latents are both highly expressive and identity-agnostic, enabling high-fidelity, detailed cross-identity motion transfer across subjects with diverse identities, poses, and spatial configurations. To achieve this, we introduce a self-supervised, end-to-end framework that jointly learns the motion encoder and latent representation alongside a DiT-based video generative model, trained on large-scale, diverse human motion datasets. Motion-identity disentanglement is enforced via 2D spatial and color augmentations, as well as synthetic 3D renderings of cross-identity subject pairs under shared poses. Furthermore, we guide motion token learning with auxiliary decoders that promote fine-grained, semantically aligned, and depth-aware motion embeddings. Extensive experiments show that X-UniMotion outperforms state-of-the-art methods, producing highly expressive animations with superior motion fidelity and identity preservation.
Meta's new wearable lets you control screens hands-free
The glasses' sensor technology opens up new possibilities for research and development in augmented reality applications. Meta's new gesture control wristband might just be the most seamless way to control a computer yet. And no, it doesn't require surgery, a camera, or even a touchscreen. All it needs is your wrist. This futuristic device uses electrical signals from your muscles to understand what your hand wants to do, even if it never actually moves.
Six-DoF Hand-Based Teleoperation for Omnidirectional Aerial Robots
Li, Jinjie, Li, Jiaxuan, Kaneko, Kotaro, Liu, Haokun, Shu, Liming, Zhao, Moju
Omnidirectional aerial robots offer full 6-DoF independent control over position and orientation, making them popular for aerial manipulation. Although advancements in robotic autonomy, human operation remains essential in complex aerial environments. Existing teleoperation approaches for multirotors fail to fully leverage the additional DoFs provided by omnidirectional rotation. Additionally, the dexterity of human fingers should be exploited for more engaged interaction. In this work, we propose an aerial teleoperation system that brings the rotational flexibility of human hands into the unbounded aerial workspace. Our system includes two motion-tracking marker sets--one on the shoulder and one on the hand--along with a data glove to capture hand gestures. Using these inputs, we design four interaction modes for different tasks, including Spherical Mode and Cartesian Mode for long-range moving, Operation Mode for precise manipulation, as well as Locking Mode for temporary pauses, where the hand gestures are utilized for seamless mode switching. We evaluate our system on a vertically mounted valve-turning task in the real world, demonstrating how each mode contributes to effective aerial manipulation. This interaction framework bridges human dexterity with aerial robotics, paving the way for enhanced aerial teleoperation in unstructured environments.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
- Asia > China > Liaoning Province > Dalian (0.04)
Beyond the Monitor: Mixed Reality Visualization and AI for Enhanced Digital Pathology Workflow
Veerla, Jai Prakash, Guttikonda, Partha Sai, Shang, Helen H., Nasr, Mohammad Sadegh, Torres, Cesar, Luber, Jacob M.
Pathologists rely on gigapixel whole-slide images (WSIs) to diagnose diseases like cancer, yet current digital pathology tools hinder diagnosis. The immense scale of WSIs, often exceeding 100,000 X 100,000 pixels, clashes with the limited views traditional monitors offer. This mismatch forces constant panning and zooming, increasing pathologist cognitive load, causing diagnostic fatigue, and slowing pathologists' adoption of digital methods. PathVis, our mixed-reality visualization platform for Apple Vision Pro, addresses these challenges. It transforms the pathologist's interaction with data, replacing cumbersome mouse-and-monitor navigation with intuitive exploration using natural hand gestures, eye gaze, and voice commands in an immersive workspace. PathVis integrates AI to enhance diagnosis. An AI-driven search function instantly retrieves and displays the top five similar patient cases side-by-side, improving diagnostic precision and efficiency through rapid comparison. Additionally, a multimodal conversational AI assistant offers real-time image interpretation support and aids collaboration among pathologists across multiple Apple devices. By merging the directness of traditional pathology with advanced mixed-reality visualization and AI, PathVis improves diagnostic workflows, reduces cognitive strain, and makes pathology practice more effective and engaging. The PathVis source code and a demo video are publicly available at: https://github.com/jaiprakash1824/Path_Vis
- North America > United States > Texas (0.04)
- Europe > Monaco (0.04)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Diagnostic Medicine (1.00)
GAT-Grasp: Gesture-Driven Affordance Transfer for Task-Aware Robotic Grasping
Wang, Ruixiang, Zhou, Huayi, Yao, Xinyue, Liu, Guiliang, Jia, Kui
Achieving precise and generalizable grasping across diverse objects and environments is essential for intelligent and collaborative robotic systems. However, existing approaches often struggle with ambiguous affordance reasoning and limited adaptability to unseen objects, leading to suboptimal grasp execution. In this work, we propose GAT-Grasp, a gesture-driven grasping framework that directly utilizes human hand gestures to guide the generation of task-specific grasp poses with appropriate positioning and orientation. Specifically, we introduce a retrieval-based affordance transfer paradigm, leveraging the implicit correlation between hand gestures and object affordances to extract grasping knowledge from large-scale human-object interaction videos. By eliminating the reliance on pre-given object priors, GAT-Grasp enables zero-shot generalization to novel objects and cluttered environments. Real-world evaluations confirm its robustness across diverse and unseen scenarios, demonstrating reliable grasp execution in complex task settings.
FORS-EMG: A Novel sEMG Dataset for Hand Gesture Recognition Across Multiple Forearm Orientations
Rumman, Umme, Ferdousi, Arifa, Hossain, Md. Sazzad, Islam, Md. Johirul, Ahmad, Shamim, Reaz, Mamun Bin Ibne, Islam, Md. Rezaul
Surface electromyography (sEMG) signal holds great potential in the research fields of gesture recognition and the development of robust prosthetic hands. However, the sEMG signal is compromised with physiological or dynamic factors such as forearm orientations, electrode displacement, limb position, etc. The existing dataset of sEMG is limited as they often ignore these dynamic factors during recording. In this paper, we have proposed a dataset of multichannel sEMG signals to evaluate common daily living hand gestures performed with three forearm orientations. The dataset is collected from nineteen intact-limed subjects, performing twelve hand gestures with three forearm orientations: supination, rest, and pronation.Additionally, two electrode placement positions (elbow and forearm) are considered while recording the sEMG signal. The dataset is open for public access in MATLAB file format. The key purpose of the dataset is to offer an extensive resource for developing a robust machine learning classification algorithm and hand gesture recognition applications. We validated the high quality of the dataset by assessing the signal quality matrices and classification performance, utilizing popular machine learning algorithms, various feature extraction methods, and variable window size. The obtained result highlighted the significant potential of this novel sEMG dataset that can be used as a benchmark for developing hand gesture recognition systems, conducting clinical research on sEMG, and developing human-computer interaction applications. Dataset:https://www.kaggle.com/datasets/ummerummanchaity/fors-emg-a-novel-semg-dataset/data
- Asia > Malaysia (0.05)
- North America > United States (0.04)
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
- (6 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Non-verbal Interaction and Interface with a Quadruped Robot using Body and Hand Gestures: Design and User Experience Evaluation
Shin, Soohyun, Evetts, Trevor, Saylor, Hunter, Kim, Hyunji, Woo, Soojin, Rhee, Wonhwha, Kim, Seong-Woo
In recent years, quadruped robots have attracted significant attention due to their practical advantages in maneuverability, particularly when navigating rough terrain and climbing stairs. As these robots become more integrated into various industries, including construction and healthcare, researchers have increasingly focused on developing intuitive interaction methods such as speech and gestures that do not require separate devices such as keyboards or joysticks. This paper aims at investigating a comfortable and efficient interaction method with quadruped robots that possess a familiar form factor. To this end, we conducted two preliminary studies to observe how individuals naturally interact with a quadruped robot in natural and controlled settings, followed by a prototype experiment to examine human preferences for body-based and hand-based gesture controls using a Unitree Go1 Pro quadruped robot. We assessed the user experience of 13 participants using the User Experience Questionnaire and measured the time taken to complete specific tasks. The findings of our preliminary results indicate that humans have a natural preference for communicating with robots through hand and body gestures rather than speech. In addition, participants reported higher satisfaction and completed tasks more quickly when using body gestures to interact with the robot. This contradicts the fact that most gesture-based control technologies for quadruped robots are hand-based. The video is available at https://youtu.be/rysv1p1zvp4.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Florida > Alachua County > Gainesville (0.14)
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)