Hierarchical Windowed Graph Attention Network and a Large Scale Dataset for Isolated Indian Sign Language Recognition

Read original: arXiv:2407.14224 - Published 9/30/2024 by Suvajit Patra, Arkadip Maitra, Megha Tiwari, K. Kumaran, Swathy Prabhu, Swami Punyeshwarananda, Soumitra Samanta

Hierarchical Windowed Graph Attention Network and a Large Scale Dataset for Isolated Indian Sign Language Recognition

Overview

This paper proposes a novel hierarchical windowed graph attention network (HW-GAT) for isolated Indian sign language recognition.
The authors also introduce a large-scale dataset for isolated Indian sign language, called iSIGN.
The dataset contains over 300,000 isolated sign samples from 500 signers and 100 sign classes.
The HW-GAT model achieves state-of-the-art performance on the iSIGN dataset, outperforming other sign language recognition approaches.

Plain English Explanation

The paper presents a new machine learning model called the hierarchical windowed graph attention network (HW-GAT) for recognizing isolated signs in the Indian sign language.

The key idea behind HW-GAT is to break down the sign language recognition task into a hierarchical process. First, the model analyzes small, local windows of the sign video to identify key hand and body movements. Then, it combines these local insights into a broader understanding of the entire sign sequence.

This hierarchical approach allows the model to capture both the fine-grained details and the broader context of a sign, leading to more accurate recognition. The model also uses a graph attention mechanism to selectively focus on the most important parts of the input video.

To train and evaluate the HW-GAT model, the authors created a large dataset of isolated Indian sign language samples called iSIGN. This dataset contains over 300,000 examples of 100 different sign classes, recorded from 500 different signers. Having a diverse, high-quality dataset is crucial for developing effective sign language recognition systems.

The HW-GAT model achieves state-of-the-art performance on the iSIGN dataset, outperforming other sign language recognition approaches. This suggests the hierarchical and attention-based design of the model is well-suited for this task.

Technical Explanation

The paper introduces a hierarchical windowed graph attention network (HW-GAT) for isolated Indian sign language recognition. HW-GAT is a deep neural network that processes sign language videos in a multi-scale, attention-based manner.

The model first divides the input video into small temporal windows. For each window, it extracts visual features using a 3D convolutional neural network and represents the hand and body joint locations as a graph. A graph attention network is then applied to these local graphs to capture the relationships between different body parts.

The local window representations are then aggregated hierarchically using another graph attention network. This allows the model to understand the broader context of the entire sign sequence, in addition to the fine-grained details captured in the local windows.

The authors also introduce a large-scale dataset for isolated Indian sign language, called iSIGN. This dataset contains over 300,000 isolated sign samples from 500 signers across 100 sign classes. The diversity and scale of this dataset enables robust training and evaluation of sign language recognition models.

Experiments show that the proposed HW-GAT model achieves state-of-the-art performance on the iSIGN dataset, outperforming other sign language recognition approaches. This demonstrates the effectiveness of the hierarchical and attention-based design of the HW-GAT model for this task.

Critical Analysis

The paper provides a comprehensive solution for isolated Indian sign language recognition, including a novel model architecture and a large-scale dataset. The hierarchical and attention-based design of HW-GAT is a promising approach that allows the model to capture both local and global features of sign language videos.

One potential limitation of the work is that it focuses only on isolated sign recognition, rather than continuous sign language recognition. In real-world scenarios, sign language users often communicate using fluid, connected sequences of signs, rather than discrete, isolated signs. Extending the HW-GAT model to handle continuous sign language recognition would be an important next step.

Additionally, the paper does not provide a detailed analysis of the model's performance on different sign classes or across different signer demographics. Understanding the model's strengths and weaknesses for specific sign types or signer groups could help guide future improvements and applications of the technology.

Further research could also explore the generalizability of the HW-GAT model and the iSIGN dataset to other sign language recognition tasks, such as continuous sign language translation or sign language generation. Investigating the model's performance on these related tasks would help establish its broader utility in sign language technology.

Conclusion

This paper presents a novel hierarchical windowed graph attention network (HW-GAT) for isolated Indian sign language recognition, along with a large-scale dataset called iSIGN. The HW-GAT model's hierarchical and attention-based design allows it to capture both local and global features of sign language videos, leading to state-of-the-art performance on the iSIGN dataset.

The availability of the iSIGN dataset and the promising results of the HW-GAT model suggest that this work represents an important step forward in the development of robust and accurate sign language recognition systems. Further research to extend the model's capabilities to continuous sign language recognition and other related tasks could have a significant impact on improving accessibility and communication for sign language users.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Hierarchical Windowed Graph Attention Network and a Large Scale Dataset for Isolated Indian Sign Language Recognition

Suvajit Patra, Arkadip Maitra, Megha Tiwari, K. Kumaran, Swathy Prabhu, Swami Punyeshwarananda, Soumitra Samanta

Automatic Sign Language (SL) recognition is an important task in the computer vision community. To build a robust SL recognition system, we need a considerable amount of data which is lacking particularly in Indian sign language (ISL). In this paper, we introduce a large-scale isolated ISL dataset and a novel SL recognition model based on skeleton graph structure. The dataset covers 2002 daily used common words in the deaf community recorded by 20 (10 male and 10 female) deaf adult signers (contains 40033 videos). We propose a SL recognition model namely Hierarchical Windowed Graph Attention Network (HWGAT) by utilizing the human upper body skeleton graph. The HWGAT tries to capture distinctive motions by giving attention to different body parts induced by the human skeleton graph. The utility of the proposed dataset and the usefulness of our model are evaluated through extensive experiments. We pre-trained the proposed model on the presented dataset and fine-tuned it across different sign language datasets further boosting the performance of 1.10, 0.46, 0.78, and 6.84 percentage points on INCLUDE, LSA64, AUTSL and WLASL respectively compared to the existing state-of-the-art keypoints-based models.

9/30/2024

Bengali Sign Language Recognition through Hand Pose Estimation using Multi-Branch Spatial-Temporal Attention Model

Abu Saleh Musa Miah, Md. Al Mehedi Hasan, Md Hadiuzzaman, Muhammad Nazrul Islam, Jungpil Shin

Hand gesture-based sign language recognition (SLR) is one of the most advanced applications of machine learning, and computer vision uses hand gestures. Although, in the past few years, many researchers have widely explored and studied how to address BSL problems, specific unaddressed issues remain, such as skeleton and transformer-based BSL recognition. In addition, the lack of evaluation of the BSL model in various concealed environmental conditions can prove the generalized property of the existing model by facing daily life signs. As a consequence, existing BSL recognition systems provide a limited perspective of their generalisation ability as they are tested on datasets containing few BSL alphabets that have a wide disparity in gestures and are easy to differentiate. To overcome these limitations, we propose a spatial-temporal attention-based BSL recognition model considering hand joint skeletons extracted from the sequence of images. The main aim of utilising hand skeleton-based BSL data is to ensure the privacy and low-resolution sequence of images, which need minimum computational cost and low hardware configurations. Our model captures discriminative structural displacements and short-range dependency based on unified joint features projected onto high-dimensional feature space. Specifically, the use of Separable TCN combined with a powerful multi-head spatial-temporal attention architecture generated high-performance accuracy. The extensive experiments with a proposed dataset and two benchmark BSL datasets with a wide range of evaluations, such as intra- and inter-dataset evaluation settings, demonstrated that our proposed models achieve competitive performance with extremely low computational complexity and run faster than existing models.

8/27/2024

Enhancing Brazilian Sign Language Recognition through Skeleton Image Representation

Carlos Eduardo G. R. Alves, Francisco de Assis Boldt, Thiago M. Paix~ao

Effective communication is paramount for the inclusion of deaf individuals in society. However, persistent communication barriers due to limited Sign Language (SL) knowledge hinder their full participation. In this context, Sign Language Recognition (SLR) systems have been developed to improve communication between signing and non-signing individuals. In particular, there is the problem of recognizing isolated signs (Isolated Sign Language Recognition, ISLR) of great relevance in the development of vision-based SL search engines, learning tools, and translation systems. This work proposes an ISLR approach where body, hands, and facial landmarks are extracted throughout time and encoded as 2-D images. These images are processed by a convolutional neural network, which maps the visual-temporal information into a sign label. Experimental results demonstrate that our method surpassed the state-of-the-art in terms of performance metrics on two widely recognized datasets in Brazilian Sign Language (LIBRAS), the primary focus of this study. In addition to being more accurate, our method is more time-efficient and easier to train due to its reliance on a simpler network architecture and solely RGB data as input.

5/1/2024

Multi-Stream Keypoint Attention Network for Sign Language Recognition and Translation

Mo Guan, Yan Wang, Guangkun Ma, Jiarui Liu, Mingzu Sun

Sign language serves as a non-vocal means of communication, transmitting information and significance through gestures, facial expressions, and bodily movements. The majority of current approaches for sign language recognition (SLR) and translation rely on RGB video inputs, which are vulnerable to fluctuations in the background. Employing a keypoint-based strategy not only mitigates the effects of background alterations but also substantially diminishes the computational demands of the model. Nevertheless, contemporary keypoint-based methodologies fail to fully harness the implicit knowledge embedded in keypoint sequences. To tackle this challenge, our inspiration is derived from the human cognition mechanism, which discerns sign language by analyzing the interplay between gesture configurations and supplementary elements. We propose a multi-stream keypoint attention network to depict a sequence of keypoints produced by a readily available keypoint estimator. In order to facilitate interaction across multiple streams, we investigate diverse methodologies such as keypoint fusion strategies, head fusion, and self-distillation. The resulting framework is denoted as MSKA-SLR, which is expanded into a sign language translation (SLT) model through the straightforward addition of an extra translation network. We carry out comprehensive experiments on well-known benchmarks like Phoenix-2014, Phoenix-2014T, and CSL-Daily to showcase the efficacy of our methodology. Notably, we have attained a novel state-of-the-art performance in the sign language translation task of Phoenix-2014T. The code and models can be accessed at: https://github.com/sutwangyan/MSKA.

5/10/2024