Repository logo
 

A comprehensive compendium of Arabidopsis RNA-seq data

dc.contributor.authorHalladay, Gareth A., author
dc.contributor.authorBen-Hur, Asa, advisor
dc.contributor.authorChitsaz, Hamidreza, committee member
dc.contributor.authorReddy, Anireddy, committee member
dc.date.accessioned2020-06-22T11:52:29Z
dc.date.available2020-06-22T11:52:29Z
dc.date.issued2020
dc.description.abstractIn the last fifteen years, the amount of publicly available genomic sequencing data has doubled every few months. Analyzing large collections of RNA-seq datasets can provide insights that are not available when analyzing data from single experiments. There are barriers towards such analyses: combining processed data is challenging because varying methods for processing data make it difficult to compare data across studies; combining data in raw form is challenging because of the resources needed to process the data. Multiple RNA-seq compendiums, which are curated sets of RNA-seq data that have been pre-processed in a uniform fashion, exist; however, there is no such resource in plants. We created a comprehensive compendium for Arabidopsis thaliana using a pipeline based on Snakemake. We downloaded over 80 Arabidopsis studies from the Sequence Read Archive. Through a strict set of criteria, we chose 35 studies containing a total of 700 biological replicates, with a focus on the response of different Arabidopsis tissues to a variety of stresses. In order to make the studies comparable, we hand-curated the metadata, pre-processed and analyzed each sample using our pipeline. We performed exploratory analysis on the samples in our compendium for quality control, and to identify biologically distinct subgroups, using PCA and t-SNE. We discuss the differences between these two methods and show that the data separates primarily by tissue type, and to a lesser extent, by the type of stress. We identified treatment conditions for each study and generated three lists: differentially expressed genes, differentially expressed introns, and genes that were differentially expressed under multiple conditions. We then visually analyzed these groups, looking for overarching patterns within the data, finding around a thousand genes that participate in stress response across tissues and stresses.
dc.format.mediumborn digital
dc.format.mediummasters theses
dc.identifierHalladay_colostate_0053N_15724.pdf
dc.identifier.urihttps://hdl.handle.net/10217/208412
dc.languageEnglish
dc.language.isoeng
dc.publisherColorado State University. Libraries
dc.relation.ispartof2020-
dc.rightsCopyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.
dc.subjectdifferential expression
dc.subjectRNA-sequencing
dc.subjectworkflow management
dc.subjectprincipal component analysis
dc.subjectt-distributed stochastic neighbor embedding
dc.subject.lcshArabidopsis thaliana
dc.titleA comprehensive compendium of Arabidopsis RNA-seq data
dc.typeText
dcterms.rights.dplaThis Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
thesis.degree.disciplineComputer Science
thesis.degree.grantorColorado State University
thesis.degree.levelMasters
thesis.degree.nameMaster of Science (M.S.)

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Halladay_colostate_0053N_15724.pdf
Size:
6.72 MB
Format:
Adobe Portable Document Format