A Transformer-Based Multi-Stream Approach for Isolated Iranian Sign Language Recognition

Read original: arXiv:2407.09544 - Published 7/16/2024 by Ali Ghadami, Alireza Taheri, Ali Meghdari

💬

Overview

This research focuses on developing a sign language recognition system to enable better communication for the deaf and hard of hearing community.
The system aims to recognize Iranian Sign Language words using deep learning techniques like transformers.
The dataset includes 101 frequently used Iranian Sign Language words, and the model combines early fusion and late fusion transformer encoder-based networks optimized with a genetic algorithm.
The model uses hand and lip keypoints, as well as the distance and angle between hands, as features, and also employs multi-task learning with word embeddings.
The system was tested on sentence-level translation using a windowing technique and achieved 90.2% accuracy on the test data.
A sign language training software with real-time feedback was developed using the trained model.

Plain English Explanation

Sign language is the primary means of communication for millions of people around the world who are deaf or hard of hearing. However, most communication tools are designed for spoken and written languages, which can create difficulties for the deaf community.

To help bridge this gap, researchers have developed a sign language recognition system that can interpret Iranian Sign Language words using the latest deep learning techniques, such as transformers. This system allows people who use sign language to better communicate with their surroundings and access services more effectively.

The researchers used a dataset of 101 commonly used Iranian Sign Language words, often used in academic settings like universities. They trained a specialized neural network that combines early fusion and late fusion transformer encoder-based models, optimized using a genetic algorithm.

The network uses features extracted from the sign language videos, including the position of the hands and lips, as well as the distance and angle between the hands. Additionally, the researchers used the word embeddings (numerical representations of the words) as a secondary task during training, which helped the model learn more efficiently.

To test the system, the researchers used a technique called windowing to translate entire sentences made up of the words in the dataset. The model achieved an impressive accuracy of 90.2% on the test data.

Finally, the researchers developed a sign language training software that provides real-time feedback to users, leveraging the capabilities of the trained model. This software can help deaf and hard of hearing individuals improve their sign language skills and enhance their communication abilities.

Technical Explanation

The researchers aimed to develop a sign language recognition system for Iranian Sign Language, which is the primary language for millions of people. They used a dataset of 101 frequently used words in academic environments, such as universities.

The neural network architecture employed a combination of early fusion and late fusion transformer encoder-based models, which were optimized using a genetic algorithm. The input features to the network included the 2D keypoints of the hands and lips, as well as the distance and angle between the hands, extracted from the sign language videos.

In addition to the primary task of recognizing the sign language words, the researchers also used the word embeddings (numerical representations of the words) as a multi-task learning objective. This helped the model learn more robust and efficient representations of the sign language words.

To evaluate the system's performance on sentence-level translation, the researchers used a windowing technique, where they divided the input sentences into smaller segments and predicted the words in each segment. The model achieved an impressive accuracy of 90.2% on the test data.

The researchers also developed a sign language training software that provides real-time feedback to users, leveraging the capabilities of the trained model. This software can help deaf and hard of hearing individuals improve their sign language skills and enhance their communication abilities.

Critical Analysis

The researchers have developed a promising sign language recognition system that can significantly improve communication and accessibility for the deaf and hard of hearing community. The use of transformers and the combination of early fusion and late fusion architectures demonstrate a thoughtful and innovative approach to the problem.

However, the researchers could have addressed some potential limitations and areas for further research. For example, the dataset size of 101 words, while useful for demonstrating the system's capabilities, may not be sufficient to cover the full breadth of sign language vocabulary used in real-world scenarios. Expanding the dataset and testing the system's performance on a larger, more diverse set of sign language words could provide valuable insights.

Additionally, the researchers could have explored the system's performance on continuous sign language recognition, which is a more challenging task that involves recognizing sequences of sign language words rather than isolated signs. This would better reflect the natural way people communicate using sign language.

Another avenue for further research could be investigating the system's cross-dataset performance and its ability to generalize to different sign language dialects or domains. This would help assess the system's robustness and potential for broader deployment.

Overall, the researchers have made a valuable contribution to the field of sign language recognition by demonstrating the effectiveness of their approach. Addressing the potential limitations and exploring further research directions could lead to even more impactful solutions for the deaf and hard of hearing community.

Conclusion

This research presents a significant advancement in sign language recognition by developing a deep learning-based system that can accurately interpret Iranian Sign Language words. By combining transformer-based architectures and leveraging features like hand and lip keypoints, the researchers have created a model that achieves 90.2% accuracy on a test dataset.

The development of a sign language training software that provides real-time feedback further enhances the practical value of this work, as it can help deaf and hard of hearing individuals improve their communication skills and better integrate with their communities. This research represents an important step towards bridging the gap between the deaf and hearing worlds, promoting accessibility and inclusivity.

Future research directions could explore expanding the dataset, investigating continuous sign language recognition, and assessing the system's cross-dataset performance to further strengthen the capabilities and real-world applicability of sign language recognition systems. Overall, this work showcases the potential of advanced deep learning techniques to empower the deaf and hard of hearing community and foster more inclusive communication environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

A Transformer-Based Multi-Stream Approach for Isolated Iranian Sign Language Recognition

Ali Ghadami, Alireza Taheri, Ali Meghdari

Sign language is an essential means of communication for millions of people around the world and serves as their primary language. However, most communication tools are developed for spoken and written languages which can cause problems and difficulties for the deaf and hard of hearing community. By developing a sign language recognition system, we can bridge this communication gap and enable people who use sign language as their main form of expression to better communicate with people and their surroundings. This recognition system increases the quality of health services, improves public services, and creates equal opportunities for the deaf community. This research aims to recognize Iranian Sign Language words with the help of the latest deep learning tools such as transformers. The dataset used includes 101 Iranian Sign Language words frequently used in academic environments such as universities. The network used is a combination of early fusion and late fusion transformer encoder-based networks optimized with the help of genetic algorithm. The selected features to train this network include hands and lips key points, and the distance and angle between hands extracted from the sign videos. Also, in addition to the training model for the classes, the embedding vectors of words are used as multi-task learning to have smoother and more efficient training. This model was also tested on sentences generated from our word dataset using a windowing technique for sentence translation. Finally, the sign language training software that provides real-time feedback to users with the help of the developed model, which has 90.2% accuracy on test data, was introduced, and in a survey, the effectiveness and efficiency of this type of sign language learning software and the impact of feedback were investigated.

7/16/2024

SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale

Shester Gueuwou, Xiaodan Du, Greg Shakhnarovich, Karen Livescu

A persistent challenge in sign language video processing, including the task of sign language to written language translation, is how we learn representations of sign language in an effective and efficient way that can preserve the important attributes of these languages, while remaining invariant to irrelevant visual differences. Informed by the nature and linguistics of signed languages, our proposed method focuses on just the most relevant parts in a signing video: the face, hands and body posture of the signer. However, instead of using pose estimation coordinates from off-the-shelf pose tracking models, which have inconsistent performance for hands and faces, we propose to learn the complex handshapes and rich facial expressions of sign languages in a self-supervised fashion. Our approach is based on learning from individual frames (rather than video sequences) and is therefore much more efficient than prior work on sign language pre-training. Compared to a recent model that established a new state of the art in sign language translation on the How2Sign dataset, our approach yields similar translation performance, using less than 3% of the compute.

6/12/2024

PenSLR: Persian end-to-end Sign Language Recognition Using Ensembling

Amirparsa Salmankhah, Amirreza Rajabi, Negin Kheirmand, Ali Fadaeimanesh, Amirreza Tarabkhah, Amirreza Kazemzadeh, Hamed Farbeh

Sign Language Recognition (SLR) is a fast-growing field that aims to fill the communication gaps between the hearing-impaired and people without hearing loss. Existing solutions for Persian Sign Language (PSL) are limited to word-level interpretations, underscoring the need for more advanced and comprehensive solutions. Moreover, previous work on other languages mainly focuses on manipulating the neural network architectures or hardware configurations instead of benefiting from the aggregated results of multiple models. In this paper, we introduce PenSLR, a glove-based sign language system consisting of an Inertial Measurement Unit (IMU) and five flexible sensors powered by a deep learning framework capable of predicting variable-length sequences. We achieve this in an end-to-end manner by leveraging the Connectionist Temporal Classification (CTC) loss function, eliminating the need for segmentation of input signals. To further enhance its capabilities, we propose a novel ensembling technique by leveraging a multiple sequence alignment algorithm known as Star Alignment. Furthermore, we introduce a new PSL dataset, including 16 PSL signs with more than 3000 time-series samples in total. We utilize this dataset to evaluate the performance of our system based on four word-level and sentence-level metrics. Our evaluations show that PenSLR achieves a remarkable word accuracy of 94.58% and 96.70% in subject-independent and subject-dependent setups, respectively. These achievements are attributable to our ensembling algorithm, which not only boosts the word-level performance by 0.51% and 1.32% in the respective scenarios but also yields significant enhancements of 1.46% and 4.00%, respectively, in sentence-level accuracy.

6/26/2024

🤿

From Rule-Based Models to Deep Learning Transformers Architectures for Natural Language Processing and Sign Language Translation Systems: Survey, Taxonomy and Performance Evaluation

Nada Shahin, Leila Ismail

With the growing Deaf and Hard of Hearing population worldwide and the persistent shortage of certified sign language interpreters, there is a pressing need for an efficient, signs-driven, integrated end-to-end translation system, from sign to gloss to text and vice-versa. There has been a wealth of research on machine translations and related reviews. However, there are few works on sign language machine translation considering the particularity of the language being continuous and dynamic. This paper aims to address this void, providing a retrospective analysis of the temporal evolution of sign language machine translation algorithms and a taxonomy of the Transformers architectures, the most used approach in language translation. We also present the requirements of a real-time Quality-of-Service sign language ma-chine translation system underpinned by accurate deep learning algorithms. We propose future research directions for sign language translation systems.

8/28/2024