Browsing by Author "Krishnaswamy, Nikhil, advisor"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
Item Open Access Applications of topological data analysis to natural language processing and computer vision(Colorado State University. Libraries, 2022) Garcia, Jason S., author; Krishnaswamy, Nikhil, advisor; Adams, Henry, committee member; Beveridge, Ross, committee memberTopological Data Analysis (TDA) uses ideas from topology to study the "shape" of data. It provides a set of tools to extract features, such as holes, voids, and connected components, from complex high-dimensional data. This thesis presents an introductory exposition of the mathematics underlying the two main tools of TDA: Persistent Homology and the MAPPER algorithm. Persistent Homology detects topological features that persist over a range of resolutions, capturing both local and global geometric information. The MAPPER algorithm is a visualization tool that provides a type of dimensional reduction that preserves topological properties of the data by projecting them onto lower dimensional simplicial complexes. Furthermore, this thesis explores recent applications of these tools to natural language processing and computer vision. These applications are divided into two main approaches: In the first approach, TDA is used to extract features from data that is then used as input for a variety of machine learning tasks, like image classification or visualizing the semantic structure of text documents. The second approach, applies the tools of TDA to the machine learning algorithms themselves. For example, using MAPPER to study how structure emerges in the weights of a trained neural network. Finally, the results of several experiments are presented. These include using Persistent Homology for image classification, and using MAPPER to visual the global structure of these data sets. Most notably, the MAPPER algorithm is used to visualize vector representations of contextualized word embeddings as they move through the encoding layers of the BERT-base transformer model.Item Open Access Embodied multimodal referring expressions generation(Colorado State University. Libraries, 2024) Alalyani, Nada H., author; Krishnaswamy, Nikhil, advisor; Ortega, Francisco, committee member; Blanchard, Nathaniel, committee member; Wang, Haonan, committee memberUsing both verbal and non-verbal modalities in generating definite descriptions of objects and locations is a critical human capability in collaborative interactions. Despite advancements in AI, embodied interactive virtual agents (IVAs) are not equipped to intelligently mix modalities to communicate their intents as humans do, which hamstrings naturalistic multimodal IVA. We introduce SCMRE, a situated corpus of multimodal referring expressions (MREs) intended for training generative AI systems in multimodal IVA, focusing on multimodal referring expressions. Our contributions include: 1) Developing an IVA platform that interprets human multimodal instructions and responds with language and gestures; 2) Providing 24 participants with 10 scenes, each involving ten equally-sized blocks randomly placed on a table. These interactions generated a dataset of 10,408 samples; 3) Analyzing SCMRE, revealing that the utilization of pointing significantly reduces the ambiguity of prompts and increases the efficiency of IVA's execution of humans' prompts; 4) Augmenting and synthesizing SCMRE, resulting in 22,159 samples to generate more data for model training; 5) Finetuning LLaMA 2-chat-13B for generating contextually-correct and situationally-fluent multimodal referring expressions; 6) Integrating the fine-tuned model into the IVA to evaluate the success of the generative model-enabled IVA in communication with humans; 7) Establishing the evaluation process which applies to both humans and IVAs and combines quantitative and qualitative metrics.Item Open Access Exploring correspondences between Gibsonian and telic affordances for object grasping using 3D geometry(Colorado State University. Libraries, 2023) Tomar, Aniket, author; Krishnaswamy, Nikhil, advisor; Blanchard, Nathaniel, committee member; Clegg, Benjamin, committee memberObject affordance understanding is an important open problem in AI and robotics. Gibsonian affordances of an object are actions afforded due to its physical structure and can be directly perceived by agents. A telic affordance is an action that is conventionalized due to an object's typical use or purpose. This work explores the extent to which a 3D CNN analogue can infer grasp affordances from only 3D shape information. This experiment was designed as a grasp classification task for 3D meshes of common kitchen objects with labels derived from human annotations. 3D shape information was found to be insufficient for current models to learn telic affordances, even though they are successful at shape classification and Gibsonian affordance learning. This was investigated further by training a classifier to predict the telic grasps directly from the human annotations to a higher accuracy indicating that the information required for successful classification existed in the dataset but was not effectively utilized. Finally, the embedding spaces of the two classifiers were compared and found to have no significant correspondence between them. This work hypothesizes that this is due to the two models capturing fundamentally different distributions of affordances with respect to objects, one representing Gibsonian affordances or shape information, and the other, telic affordancesItem Open Access Intentional microgesture recognition for extended human-computer interaction(Colorado State University. Libraries, 2023) Kandoi, Chirag, author; Blanchard, Nathaniel, advisor; Krishnaswamy, Nikhil, advisor; Soto, Hortensia, committee memberAs extended reality becomes more ubiquitous, people will more frequently interact with computer systems using gestures instead of peripheral devices. However, previous works have shown that using traditional gestures (pointing, swiping, etc.) in mid-air causes fatigue, rendering them largely unsuitable for long-term use. Some of the same researchers have promoted "microgestures"---smaller gestures requiring less gross motion---as a solution, but to date there is no dataset of intentional microgestures available to train computer vision algorithms for use in downstream interactions with computer systems such as agents deployed on XR headsets. As a step toward addressing this challenge, I present a novel video dataset of microgestures, classification results from a variety of ML models showcasing the feasibility (and difficulty) of detecting these fine-grained movements, and discuss the challenges in developing robust recognition of microgestures for human-computer interaction.Item Open Access Linear mappings: semantic transfer from transformer models for cognate detection and coreference resolution(Colorado State University. Libraries, 2022) Nath, Abhijnan, author; Krishnaswamy, Nikhil, advisor; Blanchard, Nathaniel, committee member; King, Emily J., committee memberEmbeddings or vector representations of language and their properties are useful for understanding how Natural Language Processing technology works. The usefulness of embeddings, however, depends on how contextualized or information-rich such embeddings are. In this work, I apply a novel affine (linear) mapping technique first established in the field of computer vision to embeddings generated from large Transformer-based language models. In particular, I study its use in two challenging linguistic tasks: cross-lingual cognate detection and cross-document coreference resolution. Cognate detection for two Low-Resource Languages (LRL), Assamese and Bengali, is framed as a binary classification problem using semantic (embedding-based), articulatory, and phonetic features. Linear maps for this task are extrinsically evaluated on the extent of transfer of semantic information between monolingual as well as multi-lingual models including those specialized for low-resourced Indian languages. For cross-document coreference resolution, whole-document contextual representations are generated for event and entity mentions from cross- document language models like CDLM and other BERT-variants and then linearly mapped to form coreferring clusters based on their cosine similarities. I evaluate my results on gold output based on established coreference metrics like BCUB and MUC. My findings reveal that linearly transforming vectors from one model's embedding space to another carries certain semantic information with high fidelity thereby revealing the existence of a canonical embedding space and its geometric properties for language models. Interestingly, even for a much more challenging task like coreference resolution, linear maps are able to transfer semantic information between "lighter" models or less contextual models and "larger" models with near-equivalent performance or even improved results in some cases.Item Open Access Robust gesture detection for multimodal problem solving(Colorado State University. Libraries, 2024) VanderHoeven, Hannah G., author; Blanchard, Nathaniel, advisor; Krishnaswamy, Nikhil, advisor; Cleary, Anne M., committee memberThroughout various collaborative problem solving (CPS) tasks, multiple different communicative modalities may be used by participants as they communicate with each other to work towards some goal. The ability to recognize and act on these modalities is vital for a multimodal AI agent to effectively interact with humans in a meaningful way. Potential modalities of interest might include, speech, gesture, action, pose, facial expression, and object positions in three dimensional space. As AI becomes move commonplace in various collaborative environments, there is a lot of potential to use an agent to help support learning, training and understanding of how small groups work together to complete CPS tasks. Designing a well rounded system to best understand small group interactions, multiple different modalities need to be supported. Gesture is one of many important features to consider in multimodal design. Robust gesture recognition is a key component of multimodal language understanding in addition to human-computer interaction. Most vision based approaches for gesture recognition focus on static standalone gestures that are identifiable in a single video frame. In CPS tasks, more complex gestures made up of multiple "phases" are more likely to exist. For instance deixis, or pointing, as it is used to indicate objects and referents in a scene. In this thesis, I present a novel method for robust gesture detection based on gesture phase semantics. This method is competitive with many state of the art computer vision approaches while being faster to train on annotated data. I also present various applications of this method to utilize pointing detection in a real-world collaborative task, and I discuss the importance of robust gesture detection as an important feature in multimodal agent design in further depth.Item Open Access Something is fishy! - How ambiguous language affects generalization of video action recognition networks(Colorado State University. Libraries, 2022) Patil, Dhruva Kishor, author; Beveridge, J. Ross, advisor; Krishnaswamy, Nikhil, advisor; Ortega, Francisco R., committee member; Clegg, Benjamin, committee memberModern neural networks designed for video action recognition are able to classify video snippets with high degrees of confidence and accuracy. The success of these models lies in the complex feature representations they learn from the training data, but the limitations of these models are rarely linked on a deeper level to the inconsistent quality of the training data. Although newer and better approaches pride themselves on higher evaluation metrics, this dissertation questions whether these networks are recognizing the peculiarities of dataset labels. A reason for these peculiarities lies in the deviation from standardized data collection and curation protocols that ensure quality labels. Consequently, the models may learn data properties that are irrelevant or even undesirable when trained using only a forced choice technique. One solution for these shortcomings is to reinspect the training data and gain better insights towards designing more efficient algorithms. The Something-Something dataset, a popular dataset for video action recognition, has large semantic overlaps both visually as well as linguistically between different labels provided for each video sample. It can be argued that there are multiple possible interpretations of actions in videos and the restriction of one label per video can limit or even negatively impact the network's ability to generalize to even the dataset's own testing data. To validate this claim, this dissertation introduces a human-in-the-loop procedure to review the legacy labels and relabel the Something-Something validation data. When the new labels thus obtained are used to reassess the performance of video action recognition networks, significant gains of almost 12% and 3% in the top-1 and top-5 accuracies respectively are reported. This hypothesis is further validated by visualizing the layer-wise internals of the networks using Grad-CAM to show that the model focuses on relevant salient regions when predicting an action in a video.