Theses and Dissertations

Permanent URI for this collectionhttps://hdl.handle.net/10217/100389

Browse

Now showing 1 - 3 of 3

Open Access
A GPU accelerated RNA-RNA interaction program
(Colorado State University. Libraries, 2021) Gildemaster, Brandon, author; Rajopadhye, Sanjay, advisor; Chitsaz, Hamidreza, committee member; Abdo, Zaid, committee member
RNA-RNA interaction (RRI) is important in processes like gene regulation, and is known to play roles in diseases including cancer and Alzheimer's. Large RRI computations run for days, weeks or even months, because the algorithms have time and space complexity of, respectively, O(N3M3) and O(N2M2), for sequences length N and M, and there is a need for high-throughput RRI tools. GPU parallelization of such algorithms is a challenge. We first show that the most computationally expensive part of base pair maximization (BPM) algorithms comprises O(N3) instances of upper banded tropical matrix products. We develop the first GPU library for this attaining close to theoretical machine peak (TMP). We next optimize other (fifth degree polynomial) terms in the computation and develop the first GPU implementation of the complete BPMax algorithm. We attain 12% of GPU TMP, a significant speedup over the original parallel CPU implementation, which attains less than 1% of CPU TMP. We also perform a large scale study of three small viral RNAs, hypothesized to be relevant to COVID-19.
Open Access
Read alignment using deep neural networks
(Colorado State University. Libraries, 2019) Shrestha, Akash, author; Chitsaz, Hamidreza, advisor; Ben-Hur, Asa, committee member; Abdo, Zaid, committee member
Read alignment is the process of mapping short DNA sequences into the reference genome. With the advent of consecutively evolving "next generation" sequencing technologies, the need for sequence alignment tools appeared. Many scientific communities and the companies marketing the sequencing technologies developed a whole spectrum of read aligners/mappers for different error profiles and read length characteristics. Among the most recent successfully marketed sequencing technologies are Oxford Nanopore and PacBio SMRT sequencing, which are considered top players because of their extremely long reads and low cost. However, the reads may contain error up to 20% that are not generally uniformly distributed. To deal with that level of error rate and read length, proximity preserving hashing techniques, such as Minhash and Minimizers, were utilized to quickly map a read to the target region of the reference sequence. Subsequently, a variant of global or local alignment dynamic programming is then used to give the final alignment. In this research work, we train a Deep Neural Network (DNN) to yield a hashing scheme for the highly erroneous long reads, which is deemed superior to Minhash for mapping the reads. We implemented that idea to build a read alignment tool: DNNAligner. We evaluated the performance of our aligner against the popular read aligners in the bioinformatics community currently — minimap2, bwa-mem and graphmap. Our results show that the performance of DNNAligner is comparable to other tools without any code optimization or integration of other advanced features. Moreover, DNN exhibits superior performance in comparison with Minhashon neighborhood classification.
Open Access
Theory of graph traversal edit distance, extensions, and applications
(Colorado State University. Libraries, 2019) Ebrahimpour Boroojeny, Ali, author; Chitsaz, Hamidreza, advisor; Ben-Hur, Asa, committee member; Abdo, Zaid, committee member
Many problems in applied machine learning deal with graphs (also called networks), including social networks, security, web data mining, protein function prediction, and genome informatics. The kernel paradigm beautifully decouples the learning algorithm from the underlying geometric space, which renders graph kernels important for the aforementioned applications. In this paper, we give a new graph kernel which we call graph traversal edit distance (GTED). We introduce the GTED problem and give the first polynomial time algorithm for it. Informally, the graph traversal edit distance is the minimum edit distance between two strings formed by the edge labels of respective Eulerian traversals of the two graphs. Also, GTED is motivated by and provides the first mathematical formalism for sequence co-assembly and de novo variation detection in bioinformatics. We demonstrate that GTED admits a polynomial time algorithm using a linear program in the graph product space that is guaranteed to yield an integer solution. To the best of our knowledge, this is the first approach to this problem. We also give a linear programming relaxation algorithm for a lower bound on GTED. We use GTED as a graph kernel and evaluate it by computing the accuracy of an SVM classifier on a few datasets in the literature. Our results suggest that our kernel outperforms many of the common graph kernels in the tested datasets. As a second set of experiments, we successfully cluster viral genomes using GTED on their assembly graphs obtained from de novo assembly of next-generation sequencing reads. In this project, we also show how to solve the problems of local and semi-global alignment between two graphs. Finally, we suggest an approach for speeding up the computations using pre-assumption on a subset of nodes that have to be paired.

Browse

Browsing Theses and Dissertations by Author "Abdo, Zaid, committee member"