Exploring the effects of multimodal features on a machine learning knowledge tracker

Khebour, Ibrahim, author; Krishnaswamy, Nikhil, advisor; Blanchard, Nathaniel, advisor; Peterson, Christopher, committee member

Exploring the effects of multimodal features on a machine learning knowledge tracker

Files

khebour_colostate_0053N_19339.pdf (1.75 MB)

Date

2026

Authors

Khebour, Ibrahim, author

Krishnaswamy, Nikhil, advisor

Blanchard, Nathaniel, advisor

Peterson, Christopher, committee member

Abstract

Conversations involve multiple channels of information exchange. Spoken language is the most common, but non-verbal cues such as gestures, body pose, and movements also play a role. These channels carry semantic information but are discrete and harder for machines to detect. Recent advances in multimodal Large Language Models (LLMs) show that incorporating additional modalities can improve performance, raising the question: how much do extra modalities contribute, and what are the limits of continually stacking them? Modeling the flow of conversation remains challenging for AI, particularly in natural, collaborative settings where non-verbal channels are prominent. To address this, TRACE was developed, a multimodal system that monitors shared knowledge in group tasks by tracking utterances, gestures, and actions. The system runs in real time using speech-only features, while an offline version integrates broader modalities, including problem-solving cues from speech, actions, and gestures. This thesis extends the live system by incorporating additional features. Some require training new models to process visual inputs in real time. Since components may differ from the offline version, I will conduct a comparative analysis of both systems. The evaluation will highlight cases where the live version underperforms, as some loss is expected. A comparison with the current live tracker will also measure the impact of new modalities. The Weights Task Dataset will be used for training, testing, and evaluation of action and gesture classification. Automating this process reduces the need for manual annotation and links gestures to broader semantic context, offering substantial value for future work.

Subject

human-computer interactions

multimodality

human-human interactions

common ground

URI

https://hdl.handle.net/10217/244745
https://doi.org/10.25675/3.027105

Collections

2020-
Theses and Dissertations

Full item page

Exploring the effects of multimodal features on a machine learning knowledge tracker

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Abstract

Description

Rights Access

Subject

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By