The distance and median problems in the single-cut-or-join model with single-gene duplications.
Conclusion: Our results provide a simple genome rearrangement model, extending the SCJ model to account for single-gene duplications, for which we prove a mix of tractability and hardness results. For the NP-complete rooted median problem, we design a simple Integer Linear Program. Our publicly available implementation of these algorithms for the directed distance and median problems allow to solve efficiently these problems on large instances. PMID: 32391071 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - May 13, 2020 Category: Molecular Biology Authors: Mane AC, Lafond M, Feijao PC, Chauve C Tags: Algorithms Mol Biol Source Type: research

Non-parametric and semi-parametric support estimation using SEquential RESampling random walks on biomolecular sequences.
Abstract Non-parametric and semi-parametric resampling procedures are widely used to perform support estimation in computational biology and bioinformatics. Among the most widely used methods in this class is the standard bootstrap method, which consists of random sampling with replacement. While not requiring assumptions about any particular parametric model for resampling purposes, the bootstrap and related techniques assume that sites are independent and identically distributed (i.i.d.). The i.i.d. assumption can be an over-simplification for many problems in computational biology and bioinformatics. In particu...
Source: Algorithms for Molecular Biology : AMB - April 25, 2020 Category: Molecular Biology Authors: Wang W, Smith J, Hejase HA, Liu KJ Tags: Algorithms Mol Biol Source Type: research

Linear-time algorithms for phylogenetic tree completion under Robinson-Foulds distance.
Abstract Background: We consider two fundamental computational problems that arise when comparing phylogenetic trees, rooted or unrooted, with non-identical leaf sets. The first problem arises when comparing two trees where the leaf set of one tree is a proper subset of the other. The second problem arises when the two trees to be compared have only partially overlapping leaf sets. The traditional approach to handling these problems is to first restrict the two trees to their common leaf set. An alternative approach that has shown promise is to first complete the trees by adding missing leaves, so that the resulti...
Source: Algorithms for Molecular Biology : AMB - April 23, 2020 Category: Molecular Biology Authors: Bansal MS Tags: Algorithms Mol Biol Source Type: research

From pairs of most similar sequences to phylogenetic best matches.
Conclusion: Improvements of tree-free orthology assessment methods can be expected from a combination of the accurate inference of best matches reported here and recent mathematical advances in the understanding of (reciprocal) best match graphs and orthology relations. Availability: Accompanying software is available at https://github.com/david-schaller/AsymmeTree. PMID: 32308731 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - April 22, 2020 Category: Molecular Biology Authors: Stadler PF, Geiß M, Schaller D, López Sánchez A, González Laffitte M, Valdivia DI, Hellmuth M, Hernández Rosales M Tags: Algorithms Mol Biol Source Type: research

Alignment- and reference-free phylogenomics with colored de Bruijn graphs.
Conclusions: The introduced new methodology for large-scale phylogenomics shows high potential. Application to different datasets confirms robustness of the approach. A comparison to other state-of-the-art whole-genome based methods indicates comparable or higher accuracy and efficiency. PMID: 32280365 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - April 15, 2020 Category: Molecular Biology Authors: Wittler R Tags: Algorithms Mol Biol Source Type: research

GrpClassifierEC: a novel classification approach based on the ensemble clustering space.
Conclusions: Our algorithm can be integrated with many other algorithms. In this research, we use only the k-means clustering algorithm with different k values. In future research, we propose several directions: (1) checking the effect of the clustering algorithm to build an ensemble clustering space. (2) Finding poor clustering results based on the training data, (3) reducing the volume of the data by combining similar points based on the EC. Availability and implementation: The KNIME workflow, implementing GrpClassifierEC, is available at https://malikyousef.com. PMID: 32082410 [PubMed] (Source: Algorithms for ...
Source: Algorithms for Molecular Biology : AMB - February 23, 2020 Category: Molecular Biology Authors: Abdallah L, Yousef M Tags: Algorithms Mol Biol Source Type: research

Finding all maximal perfect haplotype blocks in linear time.
Abstract Recent large-scale community sequencing efforts allow at an unprecedented level of detail the identification of genomic regions that show signatures of natural selection. Traditional methods for identifying such regions from individuals' haplotype data, however, require excessive computing times and therefore are not applicable to current datasets. In 2019, Cunha et al. (Advances in bioinformatics and computational biology: 11th Brazilian symposium on bioinformatics, BSB 2018, Niterói, Brazil, October 30 - November 1, 2018, Proceedings, 2018. 10.1007/978-3-030-01722-4_3) suggested the maximal perfe...
Source: Algorithms for Molecular Biology : AMB - February 16, 2020 Category: Molecular Biology Authors: Alanko J, Bannai H, Cazaux B, Peterlongo P, Stoye J Tags: Algorithms Mol Biol Source Type: research

Non-parametric correction of estimated gene trees using TRACTION.
Abstract Motivation: Estimated gene trees are often inaccurate, due to insufficient phylogenetic signal in the single gene alignment, among other causes. Gene tree correction aims to improve the accuracy of an estimated gene tree by using computational techniques along with auxiliary information, such as a reference species tree or sequencing data. However, gene trees and species trees can differ as a result of gene duplication and loss (GDL), incomplete lineage sorting (ILS), and other biological processes. Thus gene tree correction methods need to take estimation error as well as gene tree heterogeneity into acc...
Source: Algorithms for Molecular Biology : AMB - January 10, 2020 Category: Molecular Biology Authors: Christensen S, Molloy EK, Vachaspati P, Yammanuru A, Warnow T Tags: Algorithms Mol Biol Source Type: research

Kohdista: an efficient method to index and query possible Rmap alignments.
Conclusion: we demonstrate Kohdista is the only method that is capable of finding a significant number of high quality pairwise Rmap alignments for large eukaryote organisms in reasonable time. PMID: 31867049 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - December 24, 2019 Category: Molecular Biology Authors: Muggli MD, Puglisi SJ, Boucher C Tags: Algorithms Mol Biol Source Type: research

Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics.
Abstract Background: Genomic data analyses such as Genome-Wide Association Studies (GWAS) or Hi-C studies are often faced with the problem of partitioning chromosomes into successive regions based on a similarity matrix of high-resolution, locus-level measurements. An intuitive way of doing this is to perform a modified Hierarchical Agglomerative Clustering (HAC), where only adjacent clusters (according to the ordering of positions within a chromosome) are allowed to be merged. But a major practical drawback of this method is its quadratic time and space complexity in the number of loci, which is typically of the ...
Source: Algorithms for Molecular Biology : AMB - December 7, 2019 Category: Molecular Biology Authors: Ambroise C, Dehman A, Neuvial P, Rigaill G, Vialaneix N Tags: Algorithms Mol Biol Source Type: research

Super short operations on both gene order and intergenic sizes.
Abstract Background: The evolutionary distance between two genomes can be estimated by computing a minimum length sequence of operations, called genome rearrangements, that transform one genome into another. Usually, a genome is modeled as an ordered sequence of genes, and most of the studies in the genome rearrangement literature consist in shaping biological scenarios into mathematical models. For instance, allowing different genome rearrangements operations at the same time, adding constraints to these rearrangements (e.g., each rearrangement can affect at most a given number of genes), considering that a rearr...
Source: Algorithms for Molecular Biology : AMB - November 13, 2019 Category: Molecular Biology Authors: Oliveira AR, Jean G, Fertin G, Dias U, Dias Z Tags: Algorithms Mol Biol Source Type: research

Bayesian localization of CNV candidates in WGS data within minutes.
Conclusions: Using this approach, we discover several CNV candidates in two rat populations divergently selected for tame and aggressive behavior, consistent with earlier results concerning the domestication syndrome as well as experimental observations. Computationally, we observe a 29.5-fold decrease in memory, an average 5.8-fold speedup, as well as a 191-fold decrease in minor page faults. We also observe that metrics varied greatly in the old implementation, but not the new one. We conjecture that this is due to the better compression scheme. The fully Bayesian segmentation of the entire WGS data set required 3.5 ...
Source: Algorithms for Molecular Biology : AMB - October 3, 2019 Category: Molecular Biology Authors: Wiedenhoeft J, Cagan A, Kozhemyakina R, Gulevich R, Schliep A Tags: Algorithms Mol Biol Source Type: research

Implications of non-uniqueness in phylogenetic deconvolution of bulk DNA samples of tumors.
Conclusions: Awareness of non-uniqueness of solutions to the PPM problem is key to drawing accurate conclusions in downstream analyses based on tumor phylogenies. This work provides the theoretical foundations for non-uniqueness of solutions in tumor phylogeny inference from bulk DNA samples. PMID: 31497065 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - September 11, 2019 Category: Molecular Biology Authors: Qi Y, Pradhan D, El-Kebir M Tags: Algorithms Mol Biol Source Type: research

A branching process for homology distribution-based inference of polyploidy, speciation and loss.
Abstract Background: The statistical distribution of the similarity or difference between pairs of paralogous genes, created by whole genome doubling, or between pairs of orthologous genes in two related species is an important source of information about genomic evolution, especially in plants. Methods: We derive the mixture of distributions of sequence similarity for duplicate gene pairs generated by repeated episodes of whole gene doubling. This involves integrating sequence divergence and gene pair loss through fractionation, using a branching process and a mutational model. We account not only for the ti...
Source: Algorithms for Molecular Biology : AMB - August 9, 2019 Category: Molecular Biology Authors: Zhang Y, Zheng C, Sankoff D Tags: Algorithms Mol Biol Source Type: research

A multi-labeled tree dissimilarity measure for comparing "clonal trees" of tumor progression.
A multi-labeled tree dissimilarity measure for comparing "clonal trees" of tumor progression. Algorithms Mol Biol. 2019;14:17 Authors: Karpov N, Malikic S, Rahman MK, Sahinalp SC Abstract We introduce a new dissimilarity measure between a pair of "clonal trees", each representing the progression and mutational heterogeneity of a tumor sample, constructed by the use of single cell or bulk high throughput sequencing data. In a clonal tree, each vertex represents a specific tumor clone, and is labeled with one or more mutations in a way that each mutation is assigned to the oldest clo...
Source: Algorithms for Molecular Biology : AMB - August 4, 2019 Category: Molecular Biology Authors: Karpov N, Malikic S, Rahman MK, Sahinalp SC Tags: Algorithms Mol Biol Source Type: research

Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge.
Conclusions: Theoretical and empirical results suggest that NJMerge is a valuable technique for large-scale phylogeny estimation, especially when computational resources are limited. NJMerge is freely available on Github (http://github.com/ekmolloy/njmerge). PMID: 31360216 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - August 1, 2019 Category: Molecular Biology Authors: Molloy EK, Warnow T Tags: Algorithms Mol Biol Source Type: research

A general framework for genome rearrangement with biological constraints.
Abstract This paper generalizes previous studies on genome rearrangement under biological constraints, using double cut and join (DCJ). We propose a model for weighted DCJ, along with a family of optimization problems called φ -MCPS (Minimum Cost Parsimonious Scenario), that are based on labeled graphs. We show how to compute solutions to general instances of φ -MCPS, given an algorithm to compute φ -MCPS on a circular genome with exactly one occurrence of each gene. These general instances can have an arbitrary number of circular and linear chromosomes, and arbitrary gene content. The practicality of ...
Source: Algorithms for Molecular Biology : AMB - August 1, 2019 Category: Molecular Biology Authors: Simonaitis P, Chateau A, Swenson KM Tags: Algorithms Mol Biol Source Type: research

Prefix-free parsing for building big BWTs.
Abstract High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive-a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass ge...
Source: Algorithms for Molecular Biology : AMB - June 3, 2019 Category: Molecular Biology Authors: Boucher C, Gagie T, Kuhnle A, Langmead B, Manzini G, Mun T Tags: Algorithms Mol Biol Source Type: research

Linear time minimum segmentation enables scalable founder reconstruction.
Conclusions:  Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences. PMID: 31131017 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - May 29, 2019 Category: Molecular Biology Authors: Norri T, Cazaux B, Kosolobov D, Mäkinen V Tags: Algorithms Mol Biol Source Type: research

An average-case sublinear forward algorithm for the haploid Li and Stephens model.
Conclusions: We show a forward algorithm which avoids any tradeoff between runtime and model complexity. Our algorithm makes use of two general strategies which might be applicable to improving the time complexity of other future sequence analysis algorithms: sparse dynamic programming matrices and lazy evaluation. PMID: 30988694 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - April 17, 2019 Category: Molecular Biology Authors: Rosen YM, Paten BJ Tags: Algorithms Mol Biol Source Type: research

Differentially mutated subnetworks discovery.
Abstract Problem: We study the problem of identifying differentially mutated subnetworks of a large gene-gene interaction network, that is, subnetworks that display a significant difference in mutation frequency in two sets of cancer samples. We formally define the associated computational problem and show that the problem is NP-hard. Algorithm: We propose a novel and efficient algorithm, called DAMOKLE, to identify differentially mutated subnetworks given genome-wide mutation data for two sets of cancer samples. We prove that DAMOKLE identifies subnetworks with statistically significant difference in mutatio...
Source: Algorithms for Molecular Biology : AMB - April 14, 2019 Category: Molecular Biology Authors: Hajkarim MC, Upfal E, Vandin F Tags: Algorithms Mol Biol Source Type: research

Repairing Boolean logical models from time-series data using Answer Set Programming.
Conclusions: The method was validated using known biological models from different species, as well as synthetic models obtained from randomly generated networks. We discuss the method's limitations regarding each of the updating schemes and the considered minimization algorithm. PMID: 30962813 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - April 10, 2019 Category: Molecular Biology Authors: Lemos A, Lynce I, Monteiro PT Tags: Algorithms Mol Biol Source Type: research

Connectivity problems on heterogeneous graphs.
Conclusion: Our results demonstrate that in contrast to most connectivity problems studied in computational biology, accounting for multiplicity of biological conditions adds considerable complexity, which we propose to address with a new solver. Importantly, our results extend to several network connectivity problems that are commonly used in computational biology, such as Prize-Collecting Steiner Tree, and provide insight into the theoretical guarantees for their applications in a multiple condition setting. PMID: 30899321 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - March 24, 2019 Category: Molecular Biology Authors: Wu J, Khodaverdian A, Weitz B, Yosef N Tags: Algorithms Mol Biol Source Type: research

External memory BWT and LCP computation for sequence collections with applications.
Conclusions: We prove that our algorithm performs O ( n maxlcp ) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input. PMID: 30899322 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - March 24, 2019 Category: Molecular Biology Authors: Egidi L, Louza FA, Manzini G, Telles GP Tags: Algorithms Mol Biol Source Type: research

Semi-nonparametric modeling of topological domain formation from epigenetic data.
Conclusions: Our approach, nTDP, can form the basis of a unified, explanatory model of the relationship between epigenetic marks and topological domain structures. It can be used to predict domain boundaries for cell types, species, and conditions for which no Hi-C data is available. The model may also be of use for improving Hi-C-based domain finders. PMID: 30867673 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - March 15, 2019 Category: Molecular Biology Authors: Sefer E, Kingsford C Tags: Algorithms Mol Biol Source Type: research

SNPs detection by eBWT positional clustering.
Conclusions: Based on the results of the experiments on synthetic and real data, we conclude that the positional clustering framework can be effectively used for the problem of identifying SNPs, and it appears to be a promising approach for calling other type of variants directly on raw sequencing data. Availability: The software ebwt2snp is freely available for academic use at: https://github.com/nicolaprezza/ebwt2snp. PMID: 30839919 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - March 8, 2019 Category: Molecular Biology Authors: Prezza N, Pisanti N, Sciortino M, Rosone G Tags: Algorithms Mol Biol Source Type: research

Constrained incremental tree building: new absolute fast converging phylogeny estimation methods with improved scalability and accuracy.
In this study we present a new approach to large-scale phylogeny estimation that shares some of the features of DCM NJ but bypasses the use of supertree methods. We prove that this new approach is AFC and uses polynomial time and space. Furthermore, we describe variations on this basic approach that can be used with leaf-disjoint constraint trees (computed using methods such as maximum likelihood) to produce other methods that are likely to provide even better accuracy. Thus, we present a new generalizable technique for large-scale tree estimation that is designed to improve scalability for phylogeny estimation methods to...
Source: Algorithms for Molecular Biology : AMB - March 8, 2019 Category: Molecular Biology Authors: Zhang Q, Rao S, Warnow T Tags: Algorithms Mol Biol Source Type: research

Automated partial atomic charge assignment for drug-like molecules: a fast knapsack approach.
Abstract A key factor in computational drug design is the consistency and reliability with which intermolecular interactions between a wide variety of molecules can be described. Here we present a procedure to efficiently, reliably and automatically assign partial atomic charges to atoms based on known distributions. We formally introduce the molecular charge assignment problem, where the task is to select a charge from a set of candidate charges for every atom of a given query molecule. Charges are accompanied by a score that depends on their observed frequency in similar neighbourhoods (chemical environments) in...
Source: Algorithms for Molecular Biology : AMB - March 8, 2019 Category: Molecular Biology Authors: Engler MS, Caron B, Veen L, Geerke DP, Mark AE, Klau GW Tags: Algorithms Mol Biol Source Type: research

Regmex: a statistical tool for exploring motifs in ranked sequence lists from genomics experiments.
Conclusions: Regmex is a useful and flexible tool to analyze motif hypotheses that relates to large data sets in functional genomics. The method is available as an R package (https://github.com/muhligs/regmex). PMID: 30555524 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - December 19, 2018 Category: Molecular Biology Authors: Nielsen MM, Tataru P, Madsen T, Hobolth A, Pedersen JS Tags: Algorithms Mol Biol Source Type: research

Superbubbles revisited.
We present a reference implementation of the algorithm that accepts many commonly used formats for the input graph and provides convenient access to the improved algorithm. https://github.com/Fabianexe/Superbubble. PMID: 30519278 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - December 7, 2018 Category: Molecular Biology Authors: Gärtner F, Müller L, Stadler PF Tags: Algorithms Mol Biol Source Type: research

Coordinate systems for supergenomes.
Conclusion: Benchmarks on real-life data ranging from bacterial to fly genomes demonstrate the feasibility of computing good common coordinate systems. PMID: 30258487 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - September 29, 2018 Category: Molecular Biology Authors: Gärtner F, Höner Zu Siederdissen C, Müller L, Stadler PF Tags: Algorithms Mol Biol Source Type: research

Improved de novo peptide sequencing using LC retention time information.
Conclusions: Based on an evaluation for two prediction models on experimental data from synthesized peptides we conclude that the identification rates are improved by exploiting the chromatographic information. In our evaluation, we compare our algorithms using the retention time information with algorithms using the same scoring model, but not the retention time. PMID: 30181767 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - September 7, 2018 Category: Molecular Biology Authors: Frank Y, Hruz T, Tschager T, Venzin V Tags: Algorithms Mol Biol Source Type: research

Sorting signed circular permutations by super short operations.
Abstract Background: One way to estimate the evolutionary distance between two given genomes is to determine the minimum number of large-scale mutations, or genome rearrangements, that are necessary to transform one into the other. In this context, genomes can be represented as ordered sequences of genes, each gene being represented by a signed integer. If no gene is repeated, genomes are thus modeled as signed permutations of the form π=(π1π2…πn) , and in that case we can consider without loss of generality that one of them is the identity permutation ιn=(12…n) , and that we just ne...
Source: Algorithms for Molecular Biology : AMB - August 2, 2018 Category: Molecular Biology Authors: Oliveira AR, Fertin G, Dias U, Dias Z Tags: Algorithms Mol Biol Source Type: research

Split-inducing indels in phylogenomic analysis.
Conclusions: Suitably processed gap patterns extracted from genome-wide alignment provide a surprisingly clear phylogenetic signal and an allow the inference of accurate phylogenetic trees. PMID: 30026791 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - July 22, 2018 Category: Molecular Biology Authors: Donath A, Stadler PF Tags: Algorithms Mol Biol Source Type: research

A fast and accurate enumeration-based algorithm for haplotyping a triploid individual.
Conclusion: Extensive experimental comparisons were performed between the EHTLD algorithm and the well known HapCompass and HapTree. Compared with algorithms HapCompass and HapTree, the EHTLD algorithm can reconstruct more accurate haplotypes, which were proven by a number of experiments. PMID: 29881444 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - June 9, 2018 Category: Molecular Biology Authors: Wu J, Zhang Q Tags: Algorithms Mol Biol Source Type: research

Locus-aware decomposition of gene trees with respect to polytomous species trees.
Conclusions: Our approach allows for faster comparisons of gene and species trees and improves known algorithms for duplication inference in the presence of polytomies in the species trees. We have implemented our algorithms in a software tool available at https://github.com/mciach/LocusTreeInference. PMID: 29881445 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - June 9, 2018 Category: Molecular Biology Authors: Ciach MA, Muszewska A, Górecki P Tags: Algorithms Mol Biol Source Type: research

Finding local genome rearrangements.
Conclusions: A new variant of the weighted DCJ distance problem is addressed that ignores scenario length in its objective function. A solution to this problem provides a lower bound on the number of unlikely moves necessary when transforming one gene order into another. This lower bound aids in the study of rearrangement scenarios with respect to chromatin structure, and could eventually be used in the design of a fixed parameter algorithm with a more general objective function. PMID: 29755580 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - May 16, 2018 Category: Molecular Biology Authors: Simonaitis P, Swenson KM Tags: Algorithms Mol Biol Source Type: research

Outlier detection in BLAST hits.
Conclusion: Our experiments make a good case for using a two-step approach for accurate taxonomic assignment. We show that our method can be used as a filtering step before using phylogenetic methods and provides a way to interpret BLAST results using more information than provided by E-values and bit-scores alone. PMID: 29588650 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - March 30, 2018 Category: Molecular Biology Authors: Shah N, Altschul SF, Pop M Tags: Algorithms Mol Biol Source Type: research

FSH: fast spaced seed hashing exploiting adjacent hashes.
Conclusions: Spaced seed hashing is a routine task for several bioinformatics application. FSH allows to perform this task efficiently and raise the question of whether other hashing can be exploited to further improve the speed up. This has the potential of major impact in the field, making spaced seed applications not only accurate, but also faster and more efficient. Availability: The software FSH is freely available for academic use at: https://bitbucket.org/samu661/fsh/overview. PMID: 29588651 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - March 30, 2018 Category: Molecular Biology Authors: Girotto S, Comin M, Pizzi C Tags: Algorithms Mol Biol Source Type: research

OCTAL: Optimal Completion of gene trees in polynomial time.
Conclusions: OCTAL is a useful technique for adding missing taxa to incomplete gene trees and provides good accuracy under a wide range of model conditions. However, results show that OCTAL's accuracy can be reduced when incomplete lineage sorting is high, as the reference tree can be far from the true gene tree. Hence, this study suggests that OCTAL would benefit from using other types of reference trees instead of species trees when there are large topological distances between true gene trees and species trees. PMID: 29568323 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - March 24, 2018 Category: Molecular Biology Authors: Christensen S, Molloy EK, Vachaspati P, Warnow T Tags: Algorithms Mol Biol Source Type: research

Fast phylogenetic inference from typing data.
This article includes both a theoretical analysis and extensive experimental results concerning the proposed algorithm. We further show how this algorithm can be successfully integrated into a well known phylogenetic inference method, and how it can be used to speedup querying local phylogenetic patterns over large typing databases. PMID: 29467814 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - February 24, 2018 Category: Molecular Biology Authors: Carriço JA, Crochemore M, Francisco AP, Pissis SP, Ribeiro-Gonçalves B, Vaz C Tags: Algorithms Mol Biol Source Type: research

Derivative-free neural network for optimizing the scoring functions associated with dynamic programming of pairwise-profile alignment.
Conclusions: We developed and implemented a novel derivative-free neural network and aligner (Nepal) for optimizing sequence alignments. Nepal improved alignment quality by adapting to remote sequence alignments and increasing the expressiveness of similarity scores. Additionally, this novel scoring function can be realized using a simple matrix operation and easily incorporated into other aligners. Moreover our scoring function could potentially improve the performance of homology detection and/or multiple-sequence alignment of remote homologous sequences. The goal of the study was to provide a novel scoring function for ...
Source: Algorithms for Molecular Biology : AMB - February 24, 2018 Category: Molecular Biology Authors: Yamada KD Tags: Algorithms Mol Biol Source Type: research

A safe and complete algorithm for metagenomic assembly.
AI Abstract Background: Reconstructing the genome of a species from short fragments is one of the oldest bioinformatics problems. Metagenomic assembly is a variant of the problem asking to reconstruct the circular genomes of all bacterial species present in a sequencing sample. This problem can be naturally formulated as finding a collection of circular walks of a directed graph G that together cover all nodes, or edges, of G. Approach: We address this problem with the "safe and complete" framework of Tomescu and Medvedev (Research in computational Molecular biology-20th annual conference, RECOMB 9...
Source: Algorithms for Molecular Biology : AMB - February 17, 2018 Category: Molecular Biology Authors: Obscura Acosta N, Mäkinen V, Tomescu AI Tags: Algorithms Mol Biol Source Type: research

Time-consistent reconciliation maps and forbidden time travel.
Hellmuth M Abstract Background: In the absence of horizontal gene transfer it is possible to reconstruct the history of gene families from empirically determined orthology relations, which are equivalent to event-labeled gene trees. Knowledge of the event labels considerably simplifies the problem of reconciling a gene tree T with a species trees S, relative to the reconciliation problem without prior knowledge of the event types. It is well-known that optimal reconciliations in the unlabeled case may violate time-consistency and thus are not biologically feasible. Here we investigate the mathematical structure of...
Source: Algorithms for Molecular Biology : AMB - February 15, 2018 Category: Molecular Biology Authors: Nøjgaard N, Geiß M, Merkle D, Stadler PF, Wieseke N, Hellmuth M Tags: Algorithms Mol Biol Source Type: research

Gene tree parsimony for incomplete gene trees: addressing true biological loss.
We present new theory for GTP considering whether the incompleteness is due to gene birth and death (i.e., true biological loss) or taxon sampling, and present dynamic programming algorithms that can be used for an exact but exponential time solution for small numbers of taxa, or as a heuristic for larger numbers of taxa. We also prove that the "standard" calculations for duplications and losses exactly solve GTP when incompleteness results from taxon sampling, although they can be incorrect when incompleteness results from true biological loss. The software for the DP algorithm is freely available as open source...
Source: Algorithms for Molecular Biology : AMB - February 2, 2018 Category: Molecular Biology Authors: Bayzid MS, Warnow T Tags: Algorithms Mol Biol Source Type: research

Phylogeny reconstruction based on the length distribution of k-mismatch common substrings.
CA Abstract Background: Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between pairs of input sequences. Haubold et al. (J Comput Biol 16:1487-1500, 2009) showed how the average number of substitutions per position between two DNA sequences can be estimated based on the average length of exact common substrings. Results: In this paper, we study the length distribution of k-mismatch common substrings between two sequences. We show that the number of substitutions per position can be accurately estimated from the position of a local maximum in ...
Source: Algorithms for Molecular Biology : AMB - December 16, 2017 Category: Molecular Biology Authors: Morgenstern B, Schöbel S, Leimeister CA Tags: Algorithms Mol Biol Source Type: research

Generalized enhanced suffix array construction in external memory.
Conclusions: The proposed algorithm was validated through performance tests using real datasets from different domains, in various combinations, and showed a competitive performance. Our algorithm can also construct the generalized Burrows-Wheeler transform of a string collection with no additional cost except by the output time. PMID: 29234460 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - December 14, 2017 Category: Molecular Biology Authors: Louza FA, Telles GP, Hoffmann S, Ciferri CDA Tags: Algorithms Mol Biol Source Type: research

HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing.
CONCLUSIONS: THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign. PMID: 29026435 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - October 15, 2017 Category: Molecular Biology Authors: Wan S, Zou Q Tags: Algorithms Mol Biol Source Type: research

Algorithms for matching partially labelled sequence graphs.
CONCLUSIONS: The methods develop here still need refinement and augmentation from constraints other than the sequence data alone, such as known interactions from annotation and databases, or non-trivial relationships in genome location. With the ever growing numbers of eukaryotic genomes, it is hoped that the methods described here will open a route to the use of these data equal to the current success attained with bacterial sequences. PMID: 29021818 [PubMed] (Source: Algorithms for Molecular Biology : AMB)
Source: Algorithms for Molecular Biology : AMB - October 14, 2017 Category: Molecular Biology Authors: Taylor WR Tags: Algorithms Mol Biol Source Type: research

Biologically feasible gene trees, reconciliation maps and informative triples.
Abstract BACKGROUND: The history of gene families-which are equivalent to event-labeled gene trees-can be reconstructed from empirically estimated evolutionary event-relations containing pairs of orthologous, paralogous or xenologous genes. The question then arises as whether inferred event-labeled gene trees are biologically feasible, that is, if there is a possible true history that would explain a given gene tree. In practice, this problem is boiled down to finding a reconciliation map-also known as DTL-scenario-between the event-labeled gene trees and a (possibly unknown) species tree. RESULTS: In this co...
Source: Algorithms for Molecular Biology : AMB - September 3, 2017 Category: Molecular Biology Authors: Hellmuth M Tags: Algorithms Mol Biol Source Type: research