Multi-stream deep learning for isolated sign language recognition in videos

Alsharif, Muhammad H., author; Anderson, Charles, advisor; Kirby, Michael, committee member; Blanchard, Nathaniel, committee member; Peterson, Christopher, committee member

Multi-stream deep learning for isolated sign language recognition in videos

dc.contributor.author	Alsharif, Muhammad H., author
dc.contributor.author	Anderson, Charles, advisor
dc.contributor.author	Kirby, Michael, committee member
dc.contributor.author	Blanchard, Nathaniel, committee member
dc.contributor.author	Peterson, Christopher, committee member
dc.date.accessioned	2025-06-02T15:21:08Z
dc.date.available	2026-05-28
dc.date.issued	2025
dc.description.abstract	Isolated sign language recognition is the task of identifying signs performed in isolation across multiple frames in a video. Advances in this field have significant implications, such as improving visual communication between humans and machines and bridging the communication gap between deaf and hearing individuals. However, practical applications of this domain have been limited by two key challenges: the computational complexity of current models and the limited availability of training data for many vocabularies in sign languages. This dissertation addresses these challenges driven by improving recognition accuracy and computational efficiency. 3D convolutional models with RGB and optical flow inputs have been widely utilized in state-of-the-art methods for action recognition. Despite their significant computational costs, a systematic evaluation of their contribution to sign recognition has been limited. We first evaluate the effectiveness of 3D convolutional networks, showing that they significantly outperform their 2D counterparts on several sign recognition datasets, even when compared to a deeper 2D architecture. Additionally, this research challenges conventual assumptions about optical flow, demonstrating through ablation studies that its primary value lies in masking irrelevant (static) regions rather than improving the learning of motion patterns for sign recognition. In addition to RGB and optical flow, this work investigates skeleton-based sign language recognition using recurrent, transformer, and spatiotemporal convolutional graph networks. Our experimental results demonstrate the importance of the spatiotemporal sparse graph representation of skeleton data (coordinates of body and hand joints) in improving accuracy and interpretability through edge importance weighting. To address the limited number of training data for many signs, we propose a coarse-to-fine transfer learning approach to adapt spatiotemporal features learned from large action recognition and Turkish sign language datasets to American sign language datasets. This approach results in significant improvement for multiple modalities and benchmarks. To combine different models in a multi-stream network, we propose several methods for fusing the stream outputs before and after classification. To find the best combination of models using RGB, optical flow, or skeleton as input modalities, we train and evaluate all possible combinations in two- and three-stream networks on three sign recognition datasets. Our findings show that combining RGB and skeleton-based streams provides the most significant gain over the RGB baseline, due to greater diversity in stream predictions. In contrast, combining RGB and optical flow-based streams significantly increases the computational cost, due to optical flow extraction, without improving accuracy over two RGB streams. Our two- and three-stream networks, using only RGB and skeleton data as input modalities, achieve new state-of-the-art accuracy on the two largest ASL video datasets, which include 1000 and 2000 signs. Our approach achieves over 90% top-5 recognition accuracy on all benchmarks while significantly reducing computational costs compared to state-of-the-art methods. These findings facilitate real-time applications on mobile devices aimed at improving convenience in the daily lives of deaf individuals and helping to overcome communication barriers.
dc.format.medium	born digital
dc.format.medium	doctoral dissertations
dc.identifier	Alsharif_colostate_0053A_18807.pdf
dc.identifier.uri	https://hdl.handle.net/10217/241022
dc.identifier.uri	https://doi.org/10.25675/3.04990
dc.language	English
dc.language.iso	eng
dc.publisher	Colorado State University. Libraries
dc.relation.ispartof	2020-
dc.rights	Copyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.
dc.rights.access	Embargo expires: 05/28/2026.
dc.subject	deep learning
dc.subject	multi modal
dc.subject	sign language recognition
dc.subject	machine learning
dc.subject	artificial neural networks
dc.subject	multi stream
dc.title	Multi-stream deep learning for isolated sign language recognition in videos
dc.type	Text
dcterms.embargo.expires	2026-05-28
dcterms.embargo.terms	2026-05-28
dcterms.rights.dpla	This Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
thesis.degree.discipline	Computer Science
thesis.degree.grantor	Colorado State University
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy (Ph.D.)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Alsharif_colostate_0053A_18807.pdf
Size:: 10.53 MB
Format:: Adobe Portable Document Format

Download

Collections

2020-
Theses and Dissertations