SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark

2310.20436

Published 4/4/2024 by Zhengdi Yu, Shaoli Huang, Yongkang Cheng, Tolga Birdal

💬

Abstract

We present SignAvatars, the first large-scale, multi-prompt 3D sign language (SL) motion dataset designed to bridge the communication gap for Deaf and hard-of-hearing individuals. While there has been an exponentially growing number of research regarding digital communication, the majority of existing communication technologies primarily cater to spoken or written languages, instead of SL, the essential communication method for Deaf and hard-of-hearing communities. Existing SL datasets, dictionaries, and sign language production (SLP) methods are typically limited to 2D as annotating 3D models and avatars for SL is usually an entirely manual and labor-intensive process conducted by SL experts, often resulting in unnatural avatars. In response to these challenges, we compile and curate the SignAvatars dataset, which comprises 70,000 videos from 153 signers, totaling 8.34 million frames, covering both isolated signs and continuous, co-articulated signs, with multiple prompts including HamNoSys, spoken language, and words. To yield 3D holistic annotations, including meshes and biomechanically-valid poses of body, hands, and face, as well as 2D and 3D keypoints, we introduce an automated annotation pipeline operating on our large corpus of SL videos. SignAvatars facilitates various tasks such as 3D sign language recognition (SLR) and the novel 3D SL production (SLP) from diverse inputs like text scripts, individual words, and HamNoSys notation. Hence, to evaluate the potential of SignAvatars, we further propose a unified benchmark of 3D SL holistic motion production. We believe that this work is a significant step forward towards bringing the digital world to the Deaf and hard-of-hearing communities as well as people interacting with them.

Create account to get full access

Overview

Presents SignAvatars, a large-scale 3D sign language motion dataset designed to improve digital communication for Deaf and hard-of-hearing individuals
Addresses the need for better digital tools and technologies that cater to sign language, the primary communication method for Deaf communities
Introduces an automated annotation pipeline to generate 3D holistic annotations, including body, hand, and facial movements, from a corpus of over 70,000 sign language videos

Plain English Explanation

SignAvatars is a new dataset that aims to help bridge the communication gap for Deaf and hard-of-hearing people. While digital technologies have advanced rapidly, most of these tools still primarily support spoken or written languages, rather than sign language, which is the main way Deaf communities communicate.

Existing sign language datasets and tools are often limited to 2D, as creating 3D models and animations of sign language is a complex, labor-intensive process that typically requires specialized experts. SignAvatars addresses this challenge by using an automated system to generate rich 3D annotations, including detailed information about the movements of the body, hands, and face, from a large collection of sign language videos.

This dataset can be used to develop new digital technologies, such as sign language recognition and production systems, that allow Deaf and hearing people to communicate more effectively. By providing a comprehensive, high-quality dataset of 3D sign language content, SignAvatars represents a significant step towards making the digital world more accessible and inclusive for Deaf communities.

Technical Explanation

SignAvatars is a large-scale dataset consisting of over 70,000 sign language videos from 153 signers, totaling 8.34 million frames. The dataset includes both isolated signs and continuous, co-articulated sign language sequences, with multiple prompts such as spoken language, individual words, and HamNoSys notation.

To generate 3D holistic annotations, the researchers introduce an automated pipeline that can extract detailed information about the body, hand, and facial movements from the sign language videos. This includes 3D meshes, biomechanically-valid poses, and 2D and 3D keypoints. This approach allows for the creation of high-quality 3D sign language avatars and animations, overcoming the limitations of previous manual and labor-intensive methods.

The researchers further propose a unified benchmark for evaluating 3D sign language holistic motion production, enabling the development and assessment of advanced sign language technologies, such as sign language recognition and production from diverse inputs.

Critical Analysis

The SignAvatars dataset and associated research represent a significant advancement in the field of sign language technology. By providing a large-scale, high-quality 3D dataset and an automated annotation pipeline, the researchers have addressed a major challenge in the development of sign language-based digital tools.

However, the paper does not discuss the potential biases or limitations of the dataset, such as the diversity of the signers, the geographical or linguistic coverage of the sign language, or the accuracy of the automated annotations. Additionally, the researchers do not address the ethical considerations of using such a dataset, such as privacy concerns or the potential for misuse.

Further research is needed to thoroughly evaluate the dataset's quality and explore ways to make it more inclusive and representative of the broader Deaf community. Collaboration with Deaf organizations and sign language experts would be valuable to ensure that the dataset and associated technologies truly meet the needs of Deaf users.

Conclusion

The SignAvatars dataset and research represent an important step towards making digital technologies more accessible and inclusive for Deaf and hard-of-hearing individuals. By providing a large-scale, high-quality 3D dataset of sign language content and an automated annotation pipeline, the researchers have laid the groundwork for the development of advanced sign language recognition, production, and translation systems.

While there are still some areas for improvement and further research, the potential impact of this work on the Deaf community and the broader field of digital accessibility is significant. By bridging the communication gap and empowering Deaf users to engage with the digital world on their own terms, SignAvatars has the power to transform the lives of millions of people worldwide.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

SignAvatar: Sign Language 3D Motion Reconstruction and Generation

Lu Dong, Lipisha Chaudhary, Fei Xu, Xiao Wang, Mason Lary, Ifeoma Nwogu

Achieving expressive 3D motion reconstruction and automatic generation for isolated sign words can be challenging, due to the lack of real-world 3D sign-word data, the complex nuances of signing motions, and the cross-modal understanding of sign language semantics. To address these challenges, we introduce SignAvatar, a framework capable of both word-level sign language reconstruction and generation. SignAvatar employs a transformer-based conditional variational autoencoder architecture, effectively establishing relationships across different semantic modalities. Additionally, this approach incorporates a curriculum learning strategy to enhance the model's robustness and generalization, resulting in more realistic motions. Furthermore, we contribute the ASL3DWord dataset, composed of 3D joint rotation data for the body, hands, and face, for unique sign words. We demonstrate the effectiveness of SignAvatar through extensive experiments, showcasing its superior reconstruction and automatic generation capabilities. The code and dataset are available on the project page.

5/14/2024

cs.CV

Neural Sign Actors: A diffusion model for 3D sign language production from text

Vasileios Baltatzis, Rolandos Alexandros Potamias, Evangelos Ververas, Guanxiong Sun, Jiankang Deng, Stefanos Zafeiriou

Sign Languages (SL) serve as the primary mode of communication for the Deaf and Hard of Hearing communities. Deep learning methods for SL recognition and translation have achieved promising results. However, Sign Language Production (SLP) poses a challenge as the generated motions must be realistic and have precise semantic meaning. Most SLP methods rely on 2D data, which hinders their realism. In this work, a diffusion-based SLP model is trained on a curated large-scale dataset of 4D signing avatars and their corresponding text transcripts. The proposed method can generate dynamic sequences of 3D avatars from an unconstrained domain of discourse using a diffusion process formed on a novel and anatomically informed graph neural network defined on the SMPL-X body skeleton. Through quantitative and qualitative experiments, we show that the proposed method considerably outperforms previous methods of SLP. This work makes an important step towards realistic neural sign avatars, bridging the communication gap between Deaf and hearing communities.

4/8/2024

cs.CV

SignLLM: Sign Languages Production Large Language Models

Sen Fang, Lei Wang, Ce Zheng, Yapeng Tian, Chen Chen

In this paper, we introduce the first comprehensive multilingual sign language dataset named Prompt2Sign, which builds from public data including American Sign Language (ASL) and seven others. Our dataset transforms a vast array of videos into a streamlined, model-friendly format, optimized for training with translation models like seq2seq and text2text. Building on this new dataset, we propose SignLLM, the first multilingual Sign Language Production (SLP) model, which includes two novel multilingual SLP modes that allow for the generation of sign language gestures from input text or prompt. Both of the modes can use a new loss and a module based on reinforcement learning, which accelerates the training by enhancing the model's capability to autonomously sample high-quality data. We present benchmark results of SignLLM, which demonstrate that our model achieves state-of-the-art performance on SLP tasks across eight sign languages.

5/20/2024

cs.CV cs.CL

SkelCap: Automated Generation of Descriptive Text from Skeleton Keypoint Sequences

Ali Emre Keskin, Hacer Yalim Keles

Numerous sign language datasets exist, yet they typically cover only a limited selection of the thousands of signs used globally. Moreover, creating diverse sign language datasets is an expensive and challenging task due to the costs associated with gathering a varied group of signers. Motivated by these challenges, we aimed to develop a solution that addresses these limitations. In this context, we focused on textually describing body movements from skeleton keypoint sequences, leading to the creation of a new dataset. We structured this dataset around AUTSL, a comprehensive isolated Turkish sign language dataset. We also developed a baseline model, SkelCap, which can generate textual descriptions of body movements. This model processes the skeleton keypoints data as a vector, applies a fully connected layer for embedding, and utilizes a transformer neural network for sequence-to-sequence modeling. We conducted extensive evaluations of our model, including signer-agnostic and sign-agnostic assessments. The model achieved promising results, with a ROUGE-L score of 0.98 and a BLEU-4 score of 0.94 in the signer-agnostic evaluation. The dataset we have prepared, namely the AUTSL-SkelCap, will be made publicly available soon.

5/7/2024

cs.CV cs.LG