Ferrer-Bonsoms, J.A. (Juan A.)

Search Results

Now showing 1 - 5 of 5
  • Thumbnail Image
    Integration of CLIP experiments of RNAbinding proteins: a novel approach to predict context-dependent splicing factors from transcriptomic data
    (Springer Science and Business Media LLC, 2019) Carazo-Melo, F.(Fernando); Ferrer-Bonsoms, J.A. (Juan A.); Gimeno-Combarro, M. (Marian); Rubio, A. (Ángel)
    Background: Splicing is a genetic process that has important implications in several diseases including cancer. Deciphering the complex rules of splicing regulation is crucial to understand and treat splicing-related diseases. Splicing factors and other RNA-binding proteins (RBPs) play a key role in the regulation of splicing. The specific binding sites of an RBP can be measured using CLIP experiments. However, to unveil which RBPs regulate a condition, it is necessary to have a priori hypotheses, as a single CLIP experiment targets a single protein. Results: In this work, we present a novel methodology to predict context-specific splicing factors from transcriptomic data. For this, we systematically collect, integrate and analyze more than 900 CLIP experiments stored in four CLIP databases: POSTAR2, CLIPdb, DoRiNA and StarBase. The analysis of these experiments shows the strong coherence between the binding sites of RBPs of similar families. Augmenting this information with expression changes, we are able to correctly predict the splicing factors that regulate splicing in two gold-standard experiments in which specific splicing factors are knocked-down. Conclusions: The methodology presented in this study allows the prediction of active splicing factors in either cancer or any other condition by only using the information of transcript expression. This approach opens a wide range of possible studies to understand the splicing regulation of different conditions. A tutorial with the source code and databases is available at https://gitlab.com/fcarazo.m/sfprediction.
  • Thumbnail Image
    ISOGO: Functional annotation of protein-coding splice variants
    (2020) Castilla, C. (Carlos); Carazo-Melo, F.(Fernando); Ferrer-Bonsoms, J.A. (Juan A.); Fernández-Acín, P. (Pablo); Cassol, I. (Ignacio); Rubio, A. (Ángel)
    The advent of RNA-seq technologies has switched the paradigm of genetic analysis from a genome to a transcriptome-based perspective. Alternative splicing generates functional diversity in genes, but the precise functions of many individual isoforms are yet to be elucidated. Gene Ontology was developed to annotate gene products according to their biological processes, molecular functions and cellular components. Despite a single gene may have several gene products, most annotations are not isoform-specifc and do not distinguish the functions of the diferent proteins originated from a single gene. Several approaches have tried to automatically annotate ontologies at the isoform level, but this has shown to be a daunting task. We have developed ISOGO (ISOform+GO function imputation), a novel algorithm to predict the function of coding isoforms based on their protein domains and their correlation of expression along 11,373 cancer patients. Combining these two sources of information outperforms previous approaches: it provides an area under precision-recall curve (AUPRC) fve times larger than previous attempts and the median AUROC of assigned functions to genes is 0.82. We tested ISOGO predictions on some genes with isoform-specifc functions (BRCA1, MADD,VAMP7 and ITSN1) and they were coherent with the literature. Besides, we examined whether the main isoform of each gene -as predicted by APPRIS- was the most likely to have the annotated gene functions and it occurs in 99.4% of the genes. We also evaluated the predictions for isoform-specifc functions provided by the CAFA3 challenge and results were also convincing. To make these results available to the scientifc community, we have deployed a web application to consult ISOGO predictions (https://biotecnun.unav. es/app/isogo). Initial data, website link, isoform-specifc GO function predictions and R code is available at https://gitlab.com/icassol/isogo.
  • Thumbnail Image
    Statistical, functional, upstream, and downstream analysis of Alternative Splicing.
    (Servicio de Publicaciones. Universidad de Navarra, 2022-04-21) Ferrer-Bonsoms, J.A. (Juan A.); Rubio-Díaz-Cordovés, Á. (Ángel)
    Alternative splicing (AS) is a post-transcriptional process by which a single gene can lead to different mRNAs and thus to different functional proteins. AS plays a key role in different diseases including cancer. In fact, all hallmarks of cancer have been associated with different mechanisms of abnormal AS. Therefore, AS has been studied as a source of novel biomarkers for survival or even as therapeutic targets. The advantage of RNA-Sequencing (RNA-Seq) technologies has improved the genetic analysis of the human transcriptome. With the recent advances, it is possible to estimate isoform concentrations with good accuracy. Further, several methodologies have been developed to statistically analyze alternative splicing events. However, some aspects of AS remain an active field of research. In this work I will focus on the reconstruction of the transcriptome, the statistical analysis, the functional impact of AS, and the regulation of AS. There are RNA-Seq methods that estimate the structure and concentration of the isoforms of a gene, but the accuracy of these methods is far from perfect. Theoretical conditions to state the conditions whether a gene can be identified do exist in the case of single-end reads. Nevertheless, no theoretical results are still available for paired-end reads. Regarding the functional impact of AS, the precise functions of many of the individual isoforms are still unknown. Multiple algorithms predict gene functions based on ontologies such as Gene Ontology, but most of these approaches do not distinguish the different functions of the gene transcripts. Besides, the biological impact of changes in aberrant AS is still an open question. The statistical analysis of AS is a challenging problem due to the different accuracy in the estimate of the abundance of the isoforms and therefore of the differential splicing between conditions under study. A statistical approach that does not take into account these different accuracies for different isoforms will not be optimal. Understanding the mechanism that regulates splicing is also a challenge. There are algorithms that predict which RNA-binding protein (RBP) is driving the splicing. In our group, we developed an algorithm that uses published data on cross-linking immunoprecipitation (CLIP) experiments and makes its predictions relying on an enrichment analysis. Some assumptions of the enrichment analysis do not fit with the data and thus some biases are introduced into the prediction. In this work we faced these challenges. We present a method to select the optimal read-fragment length combination to infer the structure and concentration of the isoforms of a gene for RNA-Seq using paired-end reads. This work also provides a framework to state whether the isoforms of a gene can be inferred from paired-end reads or not. We also developed a novel method to predict isoform specific GO functions. Moreover, we described a novel statistical analysis of Alternative Splicing events based on bootstrap. This statistical analysis can be augmented with a protein domain enrichment analysis that aims to explain the functional impact of alternative splicing. Finally, we developed a novel algorithm that performs an enriched analysis based on the Poisson- Binomial distribution that avoids the biases introduced by the Fisher’s exact test and is faster than the state-of-the-art algorithms. All the outcomes of these studies are available at public repositories.
  • Thumbnail Image
    On the identifiability of the isoform deconvolution problem: application to select the proper fragment length in an RNA-seq library
    (Oxford, 2022) Ferrer-Bonsoms, J.A. (Juan A.); Morales-Urteaga, X. (Xabier); Afshar, P.T. (Pegah T.); Wong, W.H. (Wing H.); Rubio, A. (Ángel)
    Motivation: Isoform deconvolution is an NP-hard problem. The accuracy of the proposed solutions is far from perfect. At present, it is not known if gene structure and isoform concentration can be uniquely inferred given paired-end reads, and there is no objective method to select the fragment length to improve the number of identifiable genes. Different pieces of evidence suggest that the optimal fragment length is gene-dependent, stressing the need for a method that selects the fragment length according to a reasonable trade-off across all the genes in the whole genome. Results: A gene is considered to be identifiable if it is possible to get both the structure and concentration of its transcripts univocally. Here, we present a method to state the identifiability of this deconvolution problem. Assuming a given transcriptome and that the coverage is sufficient to interrogate all junction reads of the transcripts, this method states whether or not a gene is identifiable given the read length and fragment length distribution. Applying this method using different read and fragment length combinations, the optimal average fragment length for the human transcriptome is around 400–600 nt for coding genes and 150–200 nt for long non-coding RNAs. The optimal read length is the largest one that fits in the fragment length. It is also discussed the potential profit of combining several libraries to reconstruct the transcriptome. Combining two libraries of very different fragment lengths results in a significant improvement in gene identifiability.
  • Thumbnail Image
    Rediscover: an R package to identify mutually exclusive mutations
    (Oxford, 2022) Ferrer-Bonsoms, J.A. (Juan A.); Jareno, L. (Laura); Rubio, A. (Ángel)
    Motivation: Discover is an algorithm developed to identify mutually exclusive genomic events. Its main contribution is a statistical analysis based on the Poisson–Binomial (PB) distribution to take into account the mutation rate of genes and samples. Discover is very effective for identifying mutually exclusive mutations at the expense of speed in large datasets: the PB is computationally costly to estimate, and checking all the potential mutually exclusive alterations requires millions of tests. Results: We have implemented a new version of the package called Rediscover that implements exact and approximate computations of the PB. Rediscover exact implementation is slightly faster than Discover for large and medium-sized datasets. The approximation is 100–1000 times faster for them making it possible to get results in less than a minute with a standard desktop. The memory footprint is also smaller in Rediscover. The new package is available at CRAN and provides some functions to integrate its usage with other R packages such as maftools and TCGAbiolinks.