Bengali Sign Language Recognition through Hand Pose Estimation using Multi-Branch Spatial-Temporal Attention Model

Read original: arXiv:2408.14111 - Published 8/27/2024 by Abu Saleh Musa Miah, Md. Al Mehedi Hasan, Md Hadiuzzaman, Muhammad Nazrul Islam, Jungpil Shin

Bengali Sign Language Recognition through Hand Pose Estimation using Multi-Branch Spatial-Temporal Attention Model

Overview

This paper presents a novel multi-branch spatial-temporal attention model for recognizing Bengali sign language.
The model uses hand pose estimation to capture the spatial and temporal dynamics of sign language gestures.
The researchers demonstrate the effectiveness of their approach on a large-scale Bengali sign language dataset.

Plain English Explanation

The paper explores a new way to recognize Bengali sign language using artificial intelligence (AI) techniques. The key idea is to focus on the position and movements of the hands when signing.

The researchers developed a multi-branch neural network that can analyze both the spatial (position) and temporal (movement) aspects of sign language gestures. This allows the model to better understand the complex dynamics of sign language compared to simpler approaches.

The network is trained on a large dataset of Bengali sign language videos. By learning to accurately estimate the pose (position) of the hands in these videos, the model can then recognize the different sign language words and phrases being used.

The results show this hand pose-based approach outperforms previous sign language recognition systems, especially for the nuanced and rapidly changing gestures found in natural signing. This suggests the multi-branch spatial-temporal attention model is an effective way to tackle the challenge of automatically understanding sign language.

Technical Explanation

The paper introduces a Multi-Branch Spatial-Temporal Attention Model (MBSTAM) for recognizing Bengali sign language. The key innovation is the use of hand pose estimation to capture both the spatial and temporal dynamics of sign language gestures.

The MBSTAM architecture consists of:

A spatial branch that models the hand pose in each frame
A temporal branch that models the hand movement over time
An attention mechanism that learns to focus on the most important spatial and temporal features

These branches are trained end-to-end on a large Bengali sign language dataset using weakly supervised learning.

The experiments demonstrate that the MBSTAM outperforms previous state-of-the-art approaches for Bengali sign language recognition. The researchers attribute this to the model's ability to effectively capture the complex spatial-temporal patterns in sign language gestures.

Critical Analysis

The paper makes a strong case for the effectiveness of the MBSTAM approach, but a few potential limitations are worth noting:

The dataset used for training and evaluation, while large, may not fully represent the diversity of Bengali sign language in the real world.
The authors do not provide much insight into the interpretability of the model's internal representations and decision-making process.
While the results are impressive, further research is needed to understand how the MBSTAM approach would generalize to other sign language modalities or applications.

Overall, this work represents a valuable contribution to the field of sign language recognition, but as with any research, there is room for continued improvement and further investigation.

Conclusion

This paper presents a novel multi-branch neural network model that leverages hand pose estimation to recognize Bengali sign language with state-of-the-art performance.

The key innovation is the use of a spatial branch to model hand posture and a temporal branch to capture hand movements, combined with an attention mechanism to focus on the most important features. This spatial-temporal approach allows the model to better understand the complex dynamics of sign language compared to previous methods.

The results demonstrate the effectiveness of this approach on a large-scale Bengali sign language dataset. While the paper highlights some potential limitations, the MBSTAM represents an important step forward in the field of sign language recognition and could have significant implications for improving accessibility and communication for the deaf and hard-of-hearing community.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Bengali Sign Language Recognition through Hand Pose Estimation using Multi-Branch Spatial-Temporal Attention Model

Abu Saleh Musa Miah, Md. Al Mehedi Hasan, Md Hadiuzzaman, Muhammad Nazrul Islam, Jungpil Shin

Hand gesture-based sign language recognition (SLR) is one of the most advanced applications of machine learning, and computer vision uses hand gestures. Although, in the past few years, many researchers have widely explored and studied how to address BSL problems, specific unaddressed issues remain, such as skeleton and transformer-based BSL recognition. In addition, the lack of evaluation of the BSL model in various concealed environmental conditions can prove the generalized property of the existing model by facing daily life signs. As a consequence, existing BSL recognition systems provide a limited perspective of their generalisation ability as they are tested on datasets containing few BSL alphabets that have a wide disparity in gestures and are easy to differentiate. To overcome these limitations, we propose a spatial-temporal attention-based BSL recognition model considering hand joint skeletons extracted from the sequence of images. The main aim of utilising hand skeleton-based BSL data is to ensure the privacy and low-resolution sequence of images, which need minimum computational cost and low hardware configurations. Our model captures discriminative structural displacements and short-range dependency based on unified joint features projected onto high-dimensional feature space. Specifically, the use of Separable TCN combined with a powerful multi-head spatial-temporal attention architecture generated high-performance accuracy. The extensive experiments with a proposed dataset and two benchmark BSL datasets with a wide range of evaluations, such as intra- and inter-dataset evaluation settings, demonstrated that our proposed models achieve competitive performance with extremely low computational complexity and run faster than existing models.

8/27/2024

Enhancing Brazilian Sign Language Recognition through Skeleton Image Representation

Carlos Eduardo G. R. Alves, Francisco de Assis Boldt, Thiago M. Paix~ao

Effective communication is paramount for the inclusion of deaf individuals in society. However, persistent communication barriers due to limited Sign Language (SL) knowledge hinder their full participation. In this context, Sign Language Recognition (SLR) systems have been developed to improve communication between signing and non-signing individuals. In particular, there is the problem of recognizing isolated signs (Isolated Sign Language Recognition, ISLR) of great relevance in the development of vision-based SL search engines, learning tools, and translation systems. This work proposes an ISLR approach where body, hands, and facial landmarks are extracted throughout time and encoded as 2-D images. These images are processed by a convolutional neural network, which maps the visual-temporal information into a sign label. Experimental results demonstrate that our method surpassed the state-of-the-art in terms of performance metrics on two widely recognized datasets in Brazilian Sign Language (LIBRAS), the primary focus of this study. In addition to being more accurate, our method is more time-efficient and easier to train due to its reliance on a simpler network architecture and solely RGB data as input.

5/1/2024

🌐

StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language Recognition

Xiaolong Shen, Zhedong Zheng, Yi Yang

The goal of sign language recognition (SLR) is to help those who are hard of hearing or deaf overcome the communication barrier. Most existing approaches can be typically divided into two lines, i.e., Skeleton-based and RGB-based methods, but both the two lines of methods have their limitations. Skeleton-based methods do not consider facial expressions, while RGB-based approaches usually ignore the fine-grained hand structure. To overcome both limitations, we propose a new framework called Spatial-temporal Part-aware network~(StepNet), based on RGB parts. As its name suggests, it is made up of two modules: Part-level Spatial Modeling and Part-level Temporal Modeling. Part-level Spatial Modeling, in particular, automatically captures the appearance-based properties, such as hands and faces, in the feature space without the use of any keypoint-level annotations. On the other hand, Part-level Temporal Modeling implicitly mines the long-short term context to capture the relevant attributes over time. Extensive experiments demonstrate that our StepNet, thanks to spatial-temporal modules, achieves competitive Top-1 Per-instance accuracy on three commonly-used SLR benchmarks, i.e., 56.89% on WLASL, 77.2% on NMFs-CSL, and 77.1% on BOBSL. Additionally, the proposed method is compatible with the optical flow input and can produce superior performance if fused. For those who are hard of hearing, we hope that our work can act as a preliminary step.

4/9/2024

Deep Neural Network-Based Sign Language Recognition: A Comprehensive Approach Using Transfer Learning with Explainability

A. E. M Ridwan, Mushfiqul Islam Chowdhury, Mekhala Mariam Mary, Md Tahmid Chowdhury Abir

To promote inclusion and ensuring effective communication for those who rely on sign language as their main form of communication, sign language recognition (SLR) is crucial. Sign language recognition (SLR) seamlessly incorporates with diverse technology, enhancing accessibility for the deaf community by facilitating their use of digital platforms, video calls, and communication devices. To effectively solve this problem, we suggest a novel solution that uses a deep neural network to fully automate sign language recognition. This methodology integrates sophisticated preprocessing methodologies to optimise the overall performance. The architectures resnet, inception, xception, and vgg are utilised to selectively categorise images of sign language. We prepared a DNN architecture and merged it with the pre-processing architectures. In the post-processing phase, we utilised the SHAP deep explainer, which is based on cooperative game theory, to quantify the influence of specific features on the output of a machine learning model. Bhutanese-Sign-Language (BSL) dataset was used for training and testing the suggested technique. While training on Bhutanese-Sign-Language (BSL) dataset, overall ResNet50 with the DNN model performed better accuracy which is 98.90%. Our model's ability to provide informational clarity was assessed using the SHAP (SHapley Additive exPlanations) method. In part to its considerable robustness and reliability, the proposed methodological approach can be used to develop a fully automated system for sign language recognition.

9/12/2024