Repository logo
 

Multi-stream deep learning for isolated sign language recognition in videos

dc.contributor.authorAlsharif, Muhammad H., author
dc.contributor.authorAnderson, Charles, advisor
dc.contributor.authorKirby, Michael, committee member
dc.contributor.authorBlanchard, Nathaniel, committee member
dc.contributor.authorPeterson, Christopher, committee member
dc.date.accessioned2025-06-02T15:21:08Z
dc.date.available2026-05-28
dc.date.issued2025
dc.description.abstractIsolated sign language recognition is the task of identifying signs performed in isolation across multiple frames in a video. Advances in this field have significant implications, such as improving visual communication between humans and machines and bridging the communication gap between deaf and hearing individuals. However, practical applications of this domain have been limited by two key challenges: the computational complexity of current models and the limited availability of training data for many vocabularies in sign languages. This dissertation addresses these challenges driven by improving recognition accuracy and computational efficiency. 3D convolutional models with RGB and optical flow inputs have been widely utilized in state-of-the-art methods for action recognition. Despite their significant computational costs, a systematic evaluation of their contribution to sign recognition has been limited. We first evaluate the effectiveness of 3D convolutional networks, showing that they significantly outperform their 2D counterparts on several sign recognition datasets, even when compared to a deeper 2D architecture. Additionally, this research challenges conventual assumptions about optical flow, demonstrating through ablation studies that its primary value lies in masking irrelevant (static) regions rather than improving the learning of motion patterns for sign recognition. In addition to RGB and optical flow, this work investigates skeleton-based sign language recognition using recurrent, transformer, and spatiotemporal convolutional graph networks. Our experimental results demonstrate the importance of the spatiotemporal sparse graph representation of skeleton data (coordinates of body and hand joints) in improving accuracy and interpretability through edge importance weighting. To address the limited number of training data for many signs, we propose a coarse-to-fine transfer learning approach to adapt spatiotemporal features learned from large action recognition and Turkish sign language datasets to American sign language datasets. This approach results in significant improvement for multiple modalities and benchmarks. To combine different models in a multi-stream network, we propose several methods for fusing the stream outputs before and after classification. To find the best combination of models using RGB, optical flow, or skeleton as input modalities, we train and evaluate all possible combinations in two- and three-stream networks on three sign recognition datasets. Our findings show that combining RGB and skeleton-based streams provides the most significant gain over the RGB baseline, due to greater diversity in stream predictions. In contrast, combining RGB and optical flow-based streams significantly increases the computational cost, due to optical flow extraction, without improving accuracy over two RGB streams. Our two- and three-stream networks, using only RGB and skeleton data as input modalities, achieve new state-of-the-art accuracy on the two largest ASL video datasets, which include 1000 and 2000 signs. Our approach achieves over 90% top-5 recognition accuracy on all benchmarks while significantly reducing computational costs compared to state-of-the-art methods. These findings facilitate real-time applications on mobile devices aimed at improving convenience in the daily lives of deaf individuals and helping to overcome communication barriers.
dc.format.mediumborn digital
dc.format.mediumdoctoral dissertations
dc.identifierAlsharif_colostate_0053A_18807.pdf
dc.identifier.urihttps://hdl.handle.net/10217/241022
dc.languageEnglish
dc.language.isoeng
dc.publisherColorado State University. Libraries
dc.relation.ispartof2020-
dc.rightsCopyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.
dc.rights.accessEmbargo expires: 05/28/2026.
dc.subjectdeep learning
dc.subjectmulti modal
dc.subjectsign language recognition
dc.subjectmachine learning
dc.subjectartificial neural networks
dc.subjectmulti stream
dc.titleMulti-stream deep learning for isolated sign language recognition in videos
dc.typeText
dcterms.embargo.expires2026-05-28
dcterms.embargo.terms2026-05-28
dcterms.rights.dplaThis Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
thesis.degree.disciplineComputer Science
thesis.degree.grantorColorado State University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy (Ph.D.)

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Alsharif_colostate_0053A_18807.pdf
Size:
10.53 MB
Format:
Adobe Portable Document Format