Browsing by Author "Ben-Hur, Asa, advisor"
Now showing 1 - 16 of 16
Results Per Page
Sort Options
Item Open Access A comparative analysis of alternative splicing in A. Thaliana and C. Reinhardtii(Colorado State University. Libraries, 2010) Labadorf, Adam, author; Ben-Hur, Asa, advisor; Rajopadhye, Sanjay Vishnu, committee member; Reddy, Anireddy S. N., committee memberThe extent of and mechanisms causing alternative splicing in plants are not currently well understood. A recent study in the model organism Arabidopsis thaliana estimates that approximately 42% of intron-containing genes are alternatively spliced and it is speculated that this number may be much higher. Results from our previous studies showed that the single celled alga Chlamydomonas reinhardtii also exhibits alternative splicing characteristic of plants. In this work we present the results of a comprehensive alternative splicing analysis using the largest Expressed Sequence Tag (EST) datasets available for both of these organisms, describe an analysis pipeline tailored to these large datasets, and conduct a cross-organism comparative analysis of aspects related to alternative splicing.Item Open Access A comprehensive compendium of Arabidopsis RNA-seq data(Colorado State University. Libraries, 2020) Halladay, Gareth A., author; Ben-Hur, Asa, advisor; Chitsaz, Hamidreza, committee member; Reddy, Anireddy, committee memberIn the last fifteen years, the amount of publicly available genomic sequencing data has doubled every few months. Analyzing large collections of RNA-seq datasets can provide insights that are not available when analyzing data from single experiments. There are barriers towards such analyses: combining processed data is challenging because varying methods for processing data make it difficult to compare data across studies; combining data in raw form is challenging because of the resources needed to process the data. Multiple RNA-seq compendiums, which are curated sets of RNA-seq data that have been pre-processed in a uniform fashion, exist; however, there is no such resource in plants. We created a comprehensive compendium for Arabidopsis thaliana using a pipeline based on Snakemake. We downloaded over 80 Arabidopsis studies from the Sequence Read Archive. Through a strict set of criteria, we chose 35 studies containing a total of 700 biological replicates, with a focus on the response of different Arabidopsis tissues to a variety of stresses. In order to make the studies comparable, we hand-curated the metadata, pre-processed and analyzed each sample using our pipeline. We performed exploratory analysis on the samples in our compendium for quality control, and to identify biologically distinct subgroups, using PCA and t-SNE. We discuss the differences between these two methods and show that the data separates primarily by tissue type, and to a lesser extent, by the type of stress. We identified treatment conditions for each study and generated three lists: differentially expressed genes, differentially expressed introns, and genes that were differentially expressed under multiple conditions. We then visually analyzed these groups, looking for overarching patterns within the data, finding around a thousand genes that participate in stress response across tissues and stresses.Item Open Access Accurate prediction of protein function using GOstruct(Colorado State University. Libraries, 2011) Sokolov, Artem, author; Ben-Hur, Asa, advisor; Anderson, Chuck, committee member; McConnell, Ross M., committee member; Wang, Haonan, committee memberWith the growing number of sequenced genomes, automatic prediction of protein function is one of the central problems in computational biology. Traditional methods employ transfer of functional annotation on the basis of sequence or structural similarity and are unable to effectively deal with today's noisy high-throughput biological data. Most of the approaches based on machine learning, on the other hand, break the problem up into a collection of binary classification problems, effectively asking the question ''does this protein perform this particular function?''; such methods often produce a set of predictions that are inconsistent with each other. In this work, we present GOstruct, a structured-output framework that answers the question ''what function does this protein perform?'' in the context of hierarchical multilabel classification. We show that GOstruct is able to effectively deal with a large number of disparate data sources from multiple species. Our empirical results demonstrate that the framework achieves state-of-the-art accuracy in two of the recent challenges in automatic function prediction: Mousefunc and CAFA.Item Open Access Assessment of protein-protein interfaces using graph neural networks(Colorado State University. Libraries, 2021) Virupaksha, Yashwanth Reddy, author; Ben-Hur, Asa, advisor; Anderson, Charles W., committee member; Adams, Henry Hugh, committee memberProteins are fundamental building blocks of cellular function. They systematically interact with other proteins to make life happen. Understanding these protein-protein interactions is important for obtaining a detailed understanding of protein function and to enable the process of drug and vaccine design. Experimental methods for studying protein interfaces including X-ray crystallography, NMR, and Cryo-electron microscopy, are expensive, time consuming, and sometimes unsuccessful due to the unstable nature of many protein-protein interactions. Computational docking experiments are a cheap and fast alternative. Docking algorithms produce a large number of potential solutions that are then ranked by quality. However, current scoring methods are not good enough for finding a docking solution that is close to the native structure. That has led to the development of machine learning methods for this task. These methods typically involve extensive engineering of features to describe the protein complex, and are not very successful at identifying good quality solutions among the top ranks. In this thesis, we propose a scoring technique that uses graph neural networks that function at the atomic level to learn the interfaces of docked proteins without the need for feature engineering. We evaluate our model and show that it performs better than commonly used docking methods and deep learning methods that use 3D CNNs.Item Open Access Data mining techniques for temporal point processes applied to insurance claims data(Colorado State University. Libraries, 2008) Iverson, Todd Ashley, author; Ben-Hur, Asa, advisor; Iyer, Hariharan K., advisorWe explore data mining on databases consisting of insurance claims information. This dissertation focuses on two major topics we considered by way of data mining procedures. One is the development of a classification rule using kernels and support vector machines. The other is the discovery of association rules using the Apriori algorithm, its extensions, as well as a new association rules technique. With regard to the first topic we address the question-can kernel methods using an SVM classifier be used to predict patients at risk of type 2 diabetes using three years of insurance claims data? We report the results of a study in which we tested the performance of new methods for data extracted from the MarketScan® database. We summarize the results of applying popular kernels, as well as new kernels constructed specifically for this task, for support vector machines on data derived from this database. We were able to predict patients at risk of type 2 diabetes with nearly 80% success when combining a number of specialized kernels. The specific form of the data, that of a timed sequence, led us to develop two new kernels inspired by dynamic time warping. The Global Time Warping (GTW) and Local Time Warping (LTW) kernels build on an existing time warping kernel by including the timing coefficients present in classical time warping, while providing a solution for the diagonal dominance present in most alignment methods. We show that the LTW kernel performs significantly better than the existing time warping kernel when the times contained relevant information. With regard to the second topic, we provide a new theorem on closed rules that could help substantially improve the time to find a specific type of rule. An insurance claims database contains codes indicating associated diagnoses and the resulting procedures for each claim. The rules that we consider are of the form diagnoses imply procedures. In addition, we introduce a new class of interesting association rules in the context of medical claims databases and illustrate their potential uses by extracting example rules from the MarketScan® database.Item Open Access Deep learning for bioinformatics sequences: RNA basecalling and protein interactions(Colorado State University. Libraries, 2024) Neumann, Don, author; Ben-Hur, Asa, advisor; Beveridge, Ross, committee member; Blanchard, Nathaniel, committee member; Reddy, Anireddy, committee memberIn the interdisciplinary field of bioinformatics, sequence data for biological problems comes in many different forms. This ranges from proteins, to RNA, to the ionic current for a strand of nucleotides from an Oxford Nanopore Technologies sequencing device. This data can be used to elucidate the fundamentals of biological processes on many levels, which can help humanity with everything from drug design to curing disease. All of our research focuses on biological problems encoded as sequences. The main focus of our research involves Oxford Nanopore Technology sequencing devices which are capable of directly sequencing long read RNA strands as is. We first concentrate on improving the basecalling accuracy for RNA, and have published a paper with a novel architecture achieving state-of-the-art performance. The basecalling architecture uses convolutional blocks, each with progressively larger kernel sizes which improves accuracy for the noisy nature of the data. We then describe ongoing research into the detection of post-transcriptional RNA modifications in nanopore sequencing data. Building on our basecalling research, we are able to discern modifications with read level resolution. Our work will facilitate research into the detection of N6-methyladeosine (m6A) while also furthering progress in the detection of other post-transcriptional modifications. Finally, we recount our recently accepted paper regarding protein-protein and host-pathogen interaction prediction. We performed experiments demonstrating faulty experimental design for interaction prediction which have plagued the field, giving the faulty impression the problem has been solved. We then provide reasoning and recommendations for future work.Item Open Access From RNA-Seq to gene annotation using the splicegrapher method(Colorado State University. Libraries, 2013) Rogers, Mark F., author; Ben-Hur, Asa, advisor; Boucher, Christina, committee member; Anderson, Charles, committee member; Reddy, Anireddy S. N., committee memberMessenger RNA (mRNA) plays a central role in carrying out the instructions encoded in a gene. A gene's components may be combined in various ways to generate a diverse range of mRNA molecules, or transcripts, through a process called alternative splicing (AS). This allows each gene to produce different products under different conditions, such as different stages of development or in different tissues. Researchers can study the diverse set of transcripts a gene produces by sequencing its mRNA. The latest sequencing technology produces millions of short sequence reads (RNA-Seq) from mRNA transcripts, providing researchers with unprecedented opportunities to assess how genetic instructions change under different conditions. It is relatively inexpensive and easy to obtain these reads, but one limitation has been the lack of versatile methods to analyze the data. Most methods attempt to predict complete mRNA transcripts from patterns of RNA-Seq reads ascribed to a particular gene, but the short length of these reads makes transcript prediction problematic. We present a method, called SpliceGrapherXT, that takes a different approach by predicting splice graphs that capture in a single structure all the ways in which a gene's components may be assembled. Whereas other methods make predictions primarily from RNA-Seq evidence, SpliceGrapherXT uses gene annotations describing known transcripts to guide its predictions. We show that this approach allows SpliceGrapherXT to make predictions that encapsulate gene architectures more accurately than other state-of-the-art methods. This accuracy is crucial not only for updating gene annotations, but our splice graph predictions can contribute to more accurate transcript predictions as well. Finally we demonstrate that by using SpliceGrapherXT to assess AS on a genome-wide scale, we can gain new insights into the ways that specific genes and environmental conditions may impact an organism's transcriptome. SpliceGrapherXT is available for download at http://splicegrapher.sourceforge.net.Item Open Access Large margin kernel methods for calmodulin binding prediction(Colorado State University. Libraries, 2010) Hamilton, Michael, author; Ben-Hur, Asa, advisor; Anderson, Charles, committee member; Iyer, Hariharan, committee memberProtein-protein interactions are involved in nearly all molecular processes of organisms. However direct laboratory techniques for identifying binding partners remain expensive and difficult at the proteome scale. In this work, kernel methods for predicting calmodulin binding partners and calmodulin binding sites are presented. Furthermore, we compare binary and structural support vector machines with multiple kernels defined over protein sequences.Item Open Access Large margin methods for partner specific prediction of interfaces in protein complexes(Colorado State University. Libraries, 2014) Minhas, Fayyaz ul Amir Afsar, author; Ben-Hur, Asa, advisor; Draper, Bruce, committee member; Anderson, Charles, committee member; Snow, Christopher, committee memberThe study of protein interfaces and binding sites is a very important domain of research in bioinformatics. Information about the interfaces between proteins can be used not only in understanding protein function but can also be directly employed in drug design and protein engineering. However, the experimental determination of protein interfaces is cumbersome, expensive and not possible in some cases with today's technology. As a consequence, the computational prediction of protein interfaces from sequence and structure has emerged as a very active research area. A number of machine learning based techniques have been proposed for the solution to this problem. However, the prediction accuracy of most such schemes is very low. In this dissertation we present large-margin classification approaches that have been designed to directly model different aspects of protein complex formation as well as the characteristics of available data. Most existing machine learning techniques for this task are partner-independent in nature, i.e., they ignore the fact that the binding propensity of a protein to bind to another protein is dependent upon characteristics of residues in both proteins. We have developed a pairwise support vector machine classifier called PAIRpred to predict protein interfaces in a partner-specific fashion. Due to its more detailed model of the problem, PAIRpred offers state of the art accuracy in predicting both binding sites at the protein level as well as inter-protein residue contacts at the complex level. PAIRpred uses sequence and structure conservation, local structural similarity and surface geometry, residue solvent exposure and template based features derived from the unbound structures of proteins forming a protein complex. We have investigated the impact of explicitly modeling the inter-dependencies between residues that are imposed by the overall structure of a protein during the formation of a protein complex through transductive and semi-supervised learning models. We also present a novel multiple instance learning scheme called MI-1 that explicitly models imprecision in sequence-level annotations of binding sites in proteins that bind calmodulin to achieve state of the art prediction accuracy for this task.Item Open Access Large-scale automated protein function prediction(Colorado State University. Libraries, 2016) Kahanda, Indika, author; Ben-Hur, Asa, advisor; Anderson, Chuck, committee member; Draper, Bruce, committee member; Zhou, Wen, committee memberProteins are the workhorses of life, and identifying their functions is a very important biological problem. The function of a protein can be loosely defined as everything it performs or happens to it. The Gene Ontology (GO) is a structured vocabulary which captures protein function in a hierarchical manner and contains thousands of terms. Through various wet-lab experiments over the years scientists have been able to annotate a large number of proteins with GO categories which reflect their functionality. However, experimentally determining protein functions is a highly resource-intensive task, and a large fraction of proteins remain un-annotated. Recently a plethora automated methods have emerged and their reasonable success in computationally determining the functions of proteins using a variety of data sources – by sequence/structure similarity or using various biological network data, has led to establishing automated function prediction (AFP) as an important problem in bioinformatics. In a typical machine learning problem, cross-validation is the protocol of choice for evaluating the accuracy of a classifier. But, due to the process of accumulation of annotations over time, we identify the AFP as a combination of two sub-tasks: making predictions on annotated proteins and making predictions on previously unannotated proteins. In our first project, we analyze the performance of several protein function prediction methods in these two scenarios. Our results show that GOstruct, an AFP method that our lab has previously developed, and two other popular methods: binary SVMs and guilt by association, find it hard to achieve the same level of accuracy on these two tasks compared to the performance evaluated through cross-validation, and that predicting novel annotations for previously annotated proteins is a harder problem than predicting annotations for uncharacterized proteins. We develop GOstruct 2.0 by proposing improvements which allows the model to make use of information of a protein's current annotations to better handle the task of predicting novel annotations for previously annotated proteins. Experimental results on yeast and human data show that GOstruct 2.0 outperforms the original GOstruct, demonstrating the effectiveness of the proposed improvements. Although the biomedical literature is a very informative resource for identifying protein function, most AFP methods do not take advantage of the large amount of information contained in it. In our second project, we conduct the first ever comprehensive evaluation on the effectiveness of literature data for AFP. Specifically, we extract co-mentions of protein-GO term pairs and bag-of-words features from the literature and explore their effectiveness in predicting protein function. Our results show that literature features are very informative of protein function but with further room for improvement. In order to improve the quality of automatically extracted co-mentions, we formulate the classification of co-mentions as a supervised learning problem and propose a novel method based on graph kernels. Experimental results indicate the feasibility of using this co-mention classifier as a complementary method that aids the bio-curators who are responsible for maintaining databases such as Gene Ontology. This is the first study of the problem of protein-function relation extraction from biomedical text. The recently developed human phenotype ontology (HPO), which is very similar to GO, is a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. At present, only a small fraction of human protein coding genes have HPO annotations. But, researchers believe that a large portion of currently unannotated genes are related to disease phenotypes. Therefore, it is important to predict gene-HPO term associations using accurate computational methods. In our third project, we introduce PHENOstruct, a computational method that directly predicts the set of HPO terms for a given gene. We compare PHENOstruct with several baseline methods and show that it outperforms them in every respect. Furthermore, we highlight a collection of informative data sources suitable for the problem of predicting gene-HPO associations, including large scale literature mining data.Item Open Access Leveraging expression and network data for protein function prediction(Colorado State University. Libraries, 2012) Graim, Kiley, author; Ben-Hur, Asa, advisor; Anderson, Chuck, committee member; Achter, Jeff, committee memberProtein function prediction is one of the prominent problems in bioinformatics today. Protein annotation is slowly falling behind as more and more genomes are being sequenced. Experimental methods are expensive and time consuming, which leaves computational methods to fill the gap. While computational methods are still not accurate enough to be used without human supervision, this is the goal. The Gene Ontology (GO) is a collection of terms that are the standard for protein function annotations. Because of the structure of GO, protein function prediction is a hierarchical multi-label classification problem. The classification method used in this thesis is GOstruct, which performs structured predictions that take into account all GO terms. GOstruct has been shown to work well, but there are still improvements to be made. In this thesis, I work to improve predictions by building new kernels from the data that are used by GOstruct. To do this, I find key representations of the data that help define what kernels perform best on the variety of data types. I apply this methodology to function prediction in two model organisms, Saccharomyces cerevisiae and Mus musculus, and found better methods for interpreting the data.Item Open Access Machine learning models towards elucidating the plant intron retention code(Colorado State University. Libraries, 2017) Sneham, Swapnil, author; Ben-Hur, Asa, advisor; Chitsaz, Hamidreza, committee member; Peterson, Christopher, committee memberAlternative Splicing is a process that allows a single gene to encode multiple proteins. Intron Retention (IR) is a type of alternative splicing which is mainly prevalent in plants, but has been shown to regulate gene expression in various organisms and is often involved in rare human diseases. Despite its important role, not much research has been done to understand IR. The motivation behind this research work is to better understand IR and how it is regulated by various biological factors. We designed a combination of 137 features, forming an "intron retention code", to reveal the factors that contribute to IR. Using random forest and support vector machine classifiers, we show the usefulness of these features for the task of predicting whether an intron is subject to IR or not. An analysis of the top-ranking features for this task reveals a high level of similarity of the most predictive features across the three plant species, demonstrating the conservation of the factors that determine IR. We also found a high level of similarity to the top features contributing to IR in mammals. The task of predicting the response to drought stress proved more difficult, with lower levels of accuracy and lower levels of similarity across species, suggesting that additional features need to be considered for predicting condition-specific IR.Item Open Access Protein interface prediction using graph convolutional networks(Colorado State University. Libraries, 2017) Fout, Alex M., author; Ben-Hur, Asa, advisor; Anderson, Chuck, committee member; Chitsaz, Hamidreza, committee member; Zhou, Wen, committee memberProteins play a critical role in processes both within and between cells, through their interactions with each other and other molecules. Proteins interact via an interface forming a protein complex, which is difficult, expensive, and time consuming to determine experimentally, giving rise to computational approaches. These computational approaches utilize known electrochemical properties of protein amino acid residues in order to predict if they are a part of an interface or not. Prediction can occur in a partner independent fashion, where amino acid residues are considered independently of their neighbor, or in a partner specific fashion, where pairs of potentially interacting residues are considered together. Ultimately, prediction of protein interfaces can help illuminate cellular biology, improve our understanding of diseases, and aide pharmaceutical research. Interface prediction has historically been performed with a variety of methods, to include docking, template matching, and more recently, machine learning approaches. The field of machine learning has undergone a revolution of sorts with the emergence of convolutional neural networks as the leading method of choice for a wide swath of tasks. Enabled by large quantities of data and the increasing power and availability of computing resources, convolutional neural networks efficiently detect patterns in grid structured data and generate hierarchical representations that prove useful for many types of problems. This success has motivated the work presented in this thesis, which seeks to improve upon state of the art interface prediction methods by incorporating concepts from convolutional neural networks. Proteins are inherently irregular, so they don't easily conform to a grid structure, whereas a graph representation is much more natural. Various convolution operations have been proposed for graph data, each geared towards a particular application. We adapted these convolutions for use in interface prediction, and proposed two new variants. Neural networks were trained on the Docking Benchmark Dataset version 4.0 complexes and tested on the new complexes added in version 5.0. Results were compared against the state of the art method partner specific method, PAIRpred [1]. Results show that multiple variants of graph convolution outperform PAIRpred, with no method emerging as the clear winner. In the future, additional training data may be incorporated from other sources, unsupervised pretraining such as autoencoding may be employed, and a generalization of convolution to simplicial complexes may also be explored. In addition, the various graph convolution approaches may be applied to other applications with graph structured data, such as Quantitative Structure Activity Relationship (QSAR) learning, and knowledge base inference.Item Open Access Quality assessment of docked protein interfaces using 3D convolution(Colorado State University. Libraries, 2021) Bontha, Mridula, author; Ben-Hur, Asa, advisor; Beveridge, J. Ross, committee member; King, Emily J., committee memberProteins play a vital role in most biological processes, most of which occur through interactions between proteins. When proteins interact they form a complex, whose functionality is different from the individual proteins in the complex. Therefore understanding protein interactions and their interfaces is an important problem. Experimental methods for this task are expensive and time consuming, which has led to the development of docking methods for predicting the structures of protein complexes. These methods produce a large number of potential solutions, and the energy functions used in these methods are not good enough to find solutions that are close to the native state of the complex. Deep learning and its ability to model complex problems has opened up the opportunity to model protein complexes and learn from scratch how to rank docking solutions. As a part of this work, we have developed a 3D convolutional network approach that uses raw atomic densities to address this problem. Our method achieves performance which is on par with state-of-art methods. We have evaluated our model on docked protein structures simulated from four docking tools namely ZDOCK, HADDOCK, FRODOCK and ClusPro on targets from Docking Benchmark Data version 5 (DBD5).Item Open Access Quality assessment of protein structures using graph convolutional networks(Colorado State University. Libraries, 2024) Roy, Soumyadip, author; Ben-Hur, Asa, advisor; Blanchard, Nathaniel, committee member; Zhou, Wen, committee memberThe prediction of protein 3D structure is essential for understanding protein function, drug discovery, and disease mechanisms; with the advent of methods like AlphaFold that are capable of producing very high quality decoys, ensuring the quality of those decoys can provide further confidence in the accuracy of their predictions. In this work we describe Qε, a graph convolutional network that utilizes a minimal set of atom and residue features as input to predict the global distance test total score (GDTTS) and local distance difference test score (lDDT) of a decoy. To improve the model's performance, we introduce a novel loss function based on the ε-insensitive loss function used for SVM-regression. This loss function is specifically designed for the characteristics of the quality assessment problem, and provides predictions with improved accuracy over standard loss functions used for this task. Despite using only a minimal set of features, it matches the performance of recent state-of-the-art methods like DeepUMQA. The code for Qε is available at https://github.com/soumyadip1997/qepsilon.Item Open Access Uncovering the role of epigenetics in alternative splicing(Colorado State University. Libraries, 2020) Ullah, Fahad, author; Ben-Hur, Asa, advisor; Anderson, Charles, committee member; Chitsaz, Hamidreza, committee member; Reddy, Anireddy S. N., committee memberAlternative Splicing (AS) is a regulated phenomenon that enables a single gene to encode structurally and functionally different biomolecules (proteins, non-coding RNAs etc.), that play important roles in an organism's development and growth. Besides, it has been implicated in multiple diseases including cancer, thalassemia, and spinal muscular atrophy. Recent studies have shown that AS is widespread in both plants and animals. Moreover, it has been reported that splicing occurs co-transcriptionally and that chromatin state is important for understanding the regulation of AS. Most of the previous efforts made to elucidate the regulation of AS used sequence information alone. However, in this study our goal is to understand AS from an epigenetic perspective: how chromatin organization, accessibility, and modifications are involved in its regulation. Intron Retention (IR) is the most frequent form of AS in plants, however, very little is known about its regulation, particularly regarding the role of chromatin state. Therefore, as a first step, we investigate the relationship between IR and chromatin accessibility in two plant species: arabidopsis and rice. We report a strong association between chromatin accessibility and IR. Our findings suggest that chromatin is more open and accessible in IR. Furthermore, we discover motifs associated with the regulation of alternative and constitutively spliced introns, many of which match those of known transcription factors and are conserved between arabidopsis and rice, a strong indication of their functional importance. Recent studies have suggested that IR is highly prevalent in humans as well. Using the plethora of genomic data that is available in human, we design a deep learning model for predicting IR in regions of open chromatin. Our model exhibits good accuracy in terms of Area Under the ROC Curve (AUC), with median AUC = 0.80. Moreover, we identify motifs enriched in IR events with significant hits to known human transcription factors (TFs). The zinc finger family exhibits the highest activity in IR events, a prediction that is validated using ChIP-Seq data. Experiments by our collaborators have validated our predictions in several candidate IR events. Finally, as an effort to capture the complete regulatory landscape of alternative splicing, we investigate the cooperativity and interactions between regulatory sequence features. To that end, we design a self-attention model that combines convolutional and recurrent layers with a self-attention layer that helps us capture a global view of the landscape of interactions between regulatory elements in a sequence. We evaluate our method on several datasets and compare it to existing methodology. In each experiment, our model identifies numerous statistically significant TF interactions, many of which have been previously reported. Finally, using this model with the chromatin accessibility in IR dataset, we identify many interactions primarily involving the zinc finger family of transcription factors. Our approach not only provides a global, biologically relevant set of interactions but, unlike existing methods, it does not require a computationally expensive postprocessing step. In summary, this dissertation sheds light on the epigenetic regulation of alternative splicing by transcription factors, and also contributes methodologically by making the results of deep learning models more interpretable.