YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus

Read original: arXiv:2407.11144 - Published 7/17/2024 by Garrett Tanzer, Biao Zhang

YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus

Overview

The paper presents the YouTube-SL-25 corpus, a large-scale, open-domain multilingual sign language parallel corpus.
The corpus contains sign language videos from 25 different languages, with over 1 million annotated sign language-speech pairs.
The authors aim to enable advancements in sign language processing and translation by providing this diverse and high-quality dataset.

Plain English Explanation

The researchers have created a new dataset called the YouTube-SL-25 corpus that could help improve how computers understand and translate sign languages. Sign languages are the primary means of communication for many deaf and hard-of-hearing individuals around the world.

The YouTube-SL-25 corpus contains over 1 million video clips of sign language, covering 25 different sign languages. These video clips are paired with the corresponding spoken language translations. This allows machine learning models to learn how to translate between sign language and spoken language more effectively.

Having a large, diverse dataset like this is important for developing advanced sign language processing and translation technologies. Previous datasets have been smaller and covered fewer sign languages. The YouTube-SL-25 corpus provides a much richer resource for researchers and engineers working on sign language AI systems.

By making this dataset publicly available, the authors hope to accelerate progress in making technology more accessible and inclusive for deaf and hard-of-hearing communities worldwide. This could lead to improved communication tools, educational resources, and entertainment options for sign language users.

Technical Explanation

The YouTube-SL-25 corpus is a large-scale, open-domain multilingual sign language parallel dataset consisting of over 1 million annotated sign language-speech pairs across 25 different sign languages.

The corpus was constructed by crawling and filtering YouTube for high-quality sign language tutorial videos. The authors used a combination of automatic and manual techniques to extract, align, and annotate the sign language-speech pairs from these videos.

Key features of the YouTube-SL-25 corpus include:

Breadth: Covers 25 different sign languages, representing a major expansion over previous sign language datasets that focused on a handful of languages.
Scale: Over 1 million annotated sign language-speech pairs, dwarfing the size of prior sign language datasets.
Diversity: Covers a wide range of topics and domains, from educational tutorials to entertainment, making it a valuable resource for open-domain sign language processing.
Alignment: The sign language video clips are tightly synchronized and aligned with the corresponding spoken language translations.

The authors demonstrate the utility of the YouTube-SL-25 corpus through several benchmarking experiments on sign language translation and recognition tasks. The results show that models trained on this data outperform previous state-of-the-art approaches, highlighting the value of this large-scale, multilingual resource.

Critical Analysis

The YouTube-SL-25 corpus represents a significant advancement in sign language dataset creation and a valuable contribution to the field of sign language processing. By providing a much larger and more diverse dataset than previously available, the authors have enabled new opportunities for developing more robust and generalizable sign language technologies.

However, the paper does acknowledge several limitations and areas for future work:

Annotation Quality: While the authors used a combination of automatic and manual techniques to annotate the dataset, there may still be some errors or inconsistencies in the annotations that could impact model performance.
Representativeness: The corpus is limited to sign language videos found on YouTube, which may not fully represent the diversity of sign language usage and styles across different communities and contexts.
Ethical Concerns: The use of YouTube videos raises potential privacy and consent issues that should be carefully considered when using the dataset for research and development.

Additionally, while the authors demonstrate the corpus' utility for sign language translation and recognition tasks, there may be other important applications, such as sign language production or sign language understanding, that could be explored in future work.

Overall, the YouTube-SL-25 corpus is a significant step forward in providing the research community with a large-scale, multilingual sign language dataset. However, continued efforts to address the dataset's limitations and explore a wider range of applications will be necessary to fully realize its potential.

Conclusion

The YouTube-SL-25 corpus represents a major advancement in the field of sign language processing by providing a large-scale, open-domain multilingual dataset of sign language-speech pairs. This resource has the potential to enable significant progress in developing more accurate and inclusive sign language translation and recognition systems, ultimately improving communication and accessibility for deaf and hard-of-hearing communities worldwide.

While the dataset has some limitations, the authors have taken an important first step in scaling up sign language data collection and annotation. Continued research and development in this area, building on datasets like the Hong Kong Sign Language Corpus and the IISIGN Benchmark, will be crucial for realizing the full potential of sign language technology and making it more accessible and useful for end-users.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus

Garrett Tanzer, Biao Zhang

Even for better-studied sign languages like American Sign Language (ASL), data is the bottleneck for machine learning research. The situation is worse yet for the many other sign languages used by Deaf/Hard of Hearing communities around the world. In this paper, we present YouTube-SL-25, a large-scale, open-domain multilingual corpus of sign language videos with seemingly well-aligned captions drawn from YouTube. With >3000 hours of videos across >25 sign languages, YouTube-SL-25 is a) >3x the size of YouTube-ASL, b) the largest parallel sign language dataset to date, and c) the first or largest parallel dataset for many of its component languages. We provide baselines for sign-to-text tasks using a unified multilingual multitask model based on T5 and report scores on benchmarks across 4 sign languages. The results demonstrate that multilingual transfer benefits both higher- and lower-resource sign languages within YouTube-SL-25.

7/17/2024

Scaling Sign Language Translation

Biao Zhang, Garrett Tanzer, Orhan Firat

Sign language translation (SLT) addresses the problem of translating information from a sign language in video to a spoken language in text. Existing studies, while showing progress, are often limited to narrow domains and/or few sign languages and struggle with open-domain tasks. In this paper, we push forward the frontier of SLT by scaling pretraining data, model size, and number of translation directions. We perform large-scale SLT pretraining on different data including 1) noisy multilingual YouTube SLT data, 2) parallel text corpora, and 3) SLT data augmented by translating video captions to other languages with off-the-shelf machine translation models. We unify different pretraining tasks with task-specific prompts under the encoder-decoder architecture, and initialize the SLT model with pretrained (m/By)T5 models across model sizes. SLT pretraining results on How2Sign and FLEURS-ASL#0 (ASL to 42 spoken languages) demonstrate the significance of data/model scaling and cross-lingual cross-modal transfer, as well as the feasibility of zero-shot SLT. We finetune the pretrained SLT models on 5 downstream open-domain SLT benchmarks covering 5 sign languages. Experiments show substantial quality improvements over the vanilla baselines, surpassing the previous state-of-the-art (SOTA) by wide margins.

7/17/2024

iSign: A Benchmark for Indian Sign Language Processing

Abhinav Joshi, Romit Mohanty, Mounika Kanakanti, Andesha Mangla, Sudeep Choudhary, Monali Barbate, Ashutosh Modi

Indian Sign Language has limited resources for developing machine learning and data-driven approaches for automated language processing. Though text/audio-based language processing techniques have shown colossal research interest and tremendous improvements in the last few years, Sign Languages still need to catch up due to the need for more resources. To bridge this gap, in this work, we propose iSign: a benchmark for Indian Sign Language (ISL) Processing. We make three primary contributions to this work. First, we release one of the largest ISL-English datasets with more than 118K video-sentence/phrase pairs. To the best of our knowledge, it is the largest sign language dataset available for ISL. Second, we propose multiple NLP-specific tasks (including SignVideo2Text, SignPose2Text, Text2Pose, Word Prediction, and Sign Semantics) and benchmark them with the baseline models for easier access to the research community. Third, we provide detailed insights into the proposed benchmarks with a few linguistic insights into the workings of ISL. We streamline the evaluation of Sign Language processing, addressing the gaps in the NLP research community for Sign Languages. We release the dataset, tasks, and models via the following website: https://exploration-lab.github.io/iSign/

7/9/2024

SignLLM: Sign Languages Production Large Language Models

Sen Fang, Lei Wang, Ce Zheng, Yapeng Tian, Chen Chen

In this paper, we introduce the first comprehensive multilingual sign language dataset named Prompt2Sign, which builds from public data including American Sign Language (ASL) and seven others. Our dataset transforms a vast array of videos into a streamlined, model-friendly format, optimized for training with translation models like seq2seq and text2text. Building on this new dataset, we propose SignLLM, the first multilingual Sign Language Production (SLP) model, which includes two novel multilingual SLP modes that allow for the generation of sign language gestures from input text or prompt. Both of the modes can use a new loss and a module based on reinforcement learning, which accelerates the training by enhancing the model's capability to autonomously sample high-quality data. We present benchmark results of SignLLM, which demonstrate that our model achieves state-of-the-art performance on SLP tasks across eight sign languages.

5/20/2024