A multitask transformer to sign language translation using motion gesture primitives
López, Fredy Alejandro Mendoza, Rodriguez, Jefferson, Martínez, Fabio
–arXiv.org Artificial Intelligence
The absence of effective communication the deaf population represents the main social gap in this community. Furthermore, the sign language, main deaf communication tool, is unlettered, i.e., there is no formal written representation. In consequence, main challenge today is the automatic translation among spatiotemporal sign representation and natural text language. Recent approaches are based on encoder-decoder architectures, where the most relevant strategies integrate attention modules to enhance non-linear correspondences, besides, many of these approximations require complex training and architectural schemes to achieve reasonable predictions, because of the absence of intermediate text projections. However, they are still limited by the redundant background information of the video sequences. This work introduces a multitask transformer architecture that includes a gloss learning representation to achieve a more suitable translation. The proposed approach also includes a dense motion representation that enhances gestures and includes kinematic information, a key component in sign language. From this representation it is possible to avoid background information and exploit the geometry of the signs, in addition, it includes spatiotemporal representations that facilitate the alignment between gestures and glosses as an intermediate textual representation. Keywords: Sign language translation, gloss, transformer, deep learning representations 2010 MSC: 00-01, 99-00 1. Introduction Approximately 1 .5 billion people have some associated degree of hearing loss worldwide. These languages are composed of visio-spatial gestural movements and expressions, together with complex manual and non-manual interactions. Today there are more than 150 official SLs with multiple variations in each country. Like any language, there is an intrinsic grammatical richness with multiple gestural and expressive variations. These aspects make the modeling of SLs a very challenging task, even for the most advanced computer vision and representation learning methodologies. In fact, signs do not have a direct written representation, which makes it more difficult to structure the language, implying major challenges to find correspondence with other textual languages.
arXiv.org Artificial Intelligence
Mar-25-2025
- Genre:
- Research Report (1.00)
- Industry:
- Health & Medicine > Therapeutic Area > Otolaryngology (0.34)
- Technology: