A comprehensive compendium of Arabidopsis RNA-seq data
dc.contributor.author | Halladay, Gareth A., author | |
dc.contributor.author | Ben-Hur, Asa, advisor | |
dc.contributor.author | Chitsaz, Hamidreza, committee member | |
dc.contributor.author | Reddy, Anireddy, committee member | |
dc.date.accessioned | 2020-06-22T11:52:29Z | |
dc.date.available | 2020-06-22T11:52:29Z | |
dc.date.issued | 2020 | |
dc.description.abstract | In the last fifteen years, the amount of publicly available genomic sequencing data has doubled every few months. Analyzing large collections of RNA-seq datasets can provide insights that are not available when analyzing data from single experiments. There are barriers towards such analyses: combining processed data is challenging because varying methods for processing data make it difficult to compare data across studies; combining data in raw form is challenging because of the resources needed to process the data. Multiple RNA-seq compendiums, which are curated sets of RNA-seq data that have been pre-processed in a uniform fashion, exist; however, there is no such resource in plants. We created a comprehensive compendium for Arabidopsis thaliana using a pipeline based on Snakemake. We downloaded over 80 Arabidopsis studies from the Sequence Read Archive. Through a strict set of criteria, we chose 35 studies containing a total of 700 biological replicates, with a focus on the response of different Arabidopsis tissues to a variety of stresses. In order to make the studies comparable, we hand-curated the metadata, pre-processed and analyzed each sample using our pipeline. We performed exploratory analysis on the samples in our compendium for quality control, and to identify biologically distinct subgroups, using PCA and t-SNE. We discuss the differences between these two methods and show that the data separates primarily by tissue type, and to a lesser extent, by the type of stress. We identified treatment conditions for each study and generated three lists: differentially expressed genes, differentially expressed introns, and genes that were differentially expressed under multiple conditions. We then visually analyzed these groups, looking for overarching patterns within the data, finding around a thousand genes that participate in stress response across tissues and stresses. | |
dc.format.medium | born digital | |
dc.format.medium | masters theses | |
dc.identifier | Halladay_colostate_0053N_15724.pdf | |
dc.identifier.uri | https://hdl.handle.net/10217/208412 | |
dc.language | English | |
dc.language.iso | eng | |
dc.publisher | Colorado State University. Libraries | |
dc.relation.ispartof | 2020- | |
dc.rights | Copyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright. | |
dc.subject | differential expression | |
dc.subject | RNA-sequencing | |
dc.subject | workflow management | |
dc.subject | principal component analysis | |
dc.subject | t-distributed stochastic neighbor embedding | |
dc.subject | Arabidopsis thaliana | |
dc.title | A comprehensive compendium of Arabidopsis RNA-seq data | |
dc.type | Text | |
dcterms.rights.dpla | This Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s). | |
thesis.degree.discipline | Computer Science | |
thesis.degree.grantor | Colorado State University | |
thesis.degree.level | Masters | |
thesis.degree.name | Master of Science (M.S.) |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Halladay_colostate_0053N_15724.pdf
- Size:
- 6.72 MB
- Format:
- Adobe Portable Document Format