Pages-Zamora, A. (A.)

Search Results

Now showing 1 - 2 of 2
  • Thumbnail Image
    EMVC-2: an efficient single-nucleotide variant caller based on expectation maximization
    (Oxford University Press, 2024) Xargay-Ferrer, M. (Martí); Pages-Zamora, A. (A.); Ochoa-Álvarez, I. (Idoia); Dufort-y-Álvarez, G. (Guillermo)
    Motivation Single-nucleotide variants (SNVs) are the most common type of genetic variation in the human genome. Accurate and efficient detection of SNVs from next-generation sequencing (NGS) data is essential for various applications in genomics and personalized medicine. However, SNV calling methods usually suffer from high computational complexity and limited accuracy. In this context, there is a need for new methods that overcome these limitations and provide fast reliable results. Results We present EMVC-2, a novel method for SNV calling from NGS data. EMVC-2 uses a multi-class ensemble classification approach based on the expectation–maximization algorithm that infers at each locus the most likely genotype from multiple labels provided by different learners. The inferred variants are then validated by a decision tree that filters out unlikely ones. We evaluate EMVC-2 on several publicly available real human NGS data for which the set of SNVs is available, and demonstrate that it outperforms state-of-the-art variant callers in terms of accuracy and speed, on average. Availability and implementation EMVC-2 is coded in C and Python, and is freely available for download at: https://github.com/guilledufort/EMVC-2. EMVC-2 is also available in Bioconda.
  • Thumbnail Image
    Unsupervised ensemble learning for genome sequencing
    (2022) Villalvilla-Ornat, P. (P.); Pages-Zamora, A. (A.); Ochoa-Álvarez, I. (Idoia); Ruiz-Cavero, G. (G.)
    Unsupervised ensemble learning refers to methods devised for a particular task that combine data pro-vided by decision learners taking into account their reliability, which is usually inferred from the data. Here, the variant calling step of the next generation sequencing technologies is formulated as an unsuper-vised ensemble classification problem. A variant calling algorithm based on the expectation-maximization algorithm is further proposed that estimates the maximum-a-posteriori decision among a number of classes larger than the number of different labels provided by the learners. Experimental results with real human DNA sequencing data show that the proposed algorithm is competitive compared to state-of -the-art variant callers as GATK, HTSLIB, and Platypus.(c) 2022 The Author(s). Published by Elsevier Ltd.This is an open access article under the CC BY-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/4.0/ )