Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT [METHODS]
We present GGCAT, a tool for constructing both types of graphs, based on a new approach merging the k-mer counting step with the unitig construction step, as well as on numerous practical optimizations. For compacted de Bruijn graph construction, GGCAT achieves speed-ups of 3x to 21x compared with the state-of-the-art tool Cuttlefish 2. When constructing the colored variant, GGCAT achieves speed-ups of 5x to 39x compared with the state-of-the-art tool BiFrost. Additionally, GGCAT is up to 480x faster than BiFrost for batch sequence queries on colored graphs. (Source: Genome Research)
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Cracco, A., Tomescu, A. I. Tags: METHODS Source Type: research

Efficient mapping of accurate long reads in minimizer space with mapquik [METHODS]
DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introd...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Ekim, B., Sahlin, K., Medvedev, P., Berger, B., Chikhi, R. Tags: METHODS Source Type: research

Proving sequence aligners can guarantee accuracy in almost O(m log n) time through an average-case analysis of the seed-chain-extend heuristic [METHODS]
Seed-chain-extend with k-mer seeds is a powerful heuristic technique for sequence alignment used by modern sequence aligners. Although effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend with k-mers in expectation. Assume we are given a random nucleotide sequence of length ~n that is indexed (or seeded) and a mutated substring of length ~m ≤ n with mutation rate < 0.206. We prove that we can find a k = (log n) for the k-mer size such that the expe...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Shaw, J., Yu, Y. W. Tags: METHODS Source Type: research

Entropy predicts sensitivity of pseudorandom seeds [METHODS]
In this study, we propose a model to estimate the entropy of a seed and find that seeds with high entropy, according to our model, in most cases have high match sensitivity. Our discovered seed randomness–sensitivity relationship explains why some seeds perform better than others, and the relationship provides a framework for designing even more sensitive seeds. We also present three new strobemer seed constructs: mixedstrobes, altstrobes, and multistrobes. We use both simulated and biological data to show that our new seed constructs improve sequence-matching sensitivity to other strobemers. We show that the three n...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Maier, B. D., Sahlin, K. Tags: METHODS Source Type: research

Efficient minimizer orders for large values of k using minimum decycling sets [METHODS]
Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum k-mer in every L-long subsequence of the target sequence, where minimality is with respect to a predefined k-mer order. Commonly used minimizer orders select more k-mers than necessary and therefore provide limited improvement in runtime and memory usage of downstream analysis tasks. The recently introduced universal k-mer hitting sets produce minimizer orders with fewer selected k-mers. Generating compact universal k-mer hitting sets is...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Pellow, D., Pu, L., Ekim, B., Kotlar, L., Berger, B., Shamir, R., Orenstein, Y. Tags: METHODS Source Type: research

Leveraging protein language models for accurate multiple sequence alignments [METHODS]
Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. Although successful, standard methods struggle on sets of proteins with low sequence identity: the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverages massive sequ...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: McWhite, C. D., Armour-Garb, I., Singh, M. Tags: METHODS Source Type: research

Unsupervised contrastive peak caller for ATAC-seq [METHODS]
The assay for transposase-accessible chromatin with sequencing (ATAC-seq) is a common assay to identify chromatin accessible regions by using a Tn5 transposase that can access, cut, and ligate adapters to DNA fragments for subsequent amplification and sequencing. These sequenced regions are quantified and tested for enrichment in a process referred to as "peak calling." Most unsupervised peak calling methods are based on simple statistical models and suffer from elevated false positive rates. Newly developed supervised deep learning methods can be successful, but they rely on high quality labeled data for training, which c...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Vu, H. T. H., Zhang, Y., Tuteja, G., Dorman, K. S. Tags: METHODS Source Type: research

Partial alignment of multislice spatially resolved transcriptomics data [METHODS]
Spatially resolved transcriptomics (SRT) technologies measure messenger RNA (mRNA) expression at thousands of locations in a tissue slice. However, nearly all SRT technologies measure expression in two-dimensional (2D) slices extracted from a 3D tissue, thus losing information that is shared across multiple slices from the same tissue. Integrating SRT data across multiple slices can help recover this information and improve downstream expression analyses, but multislice alignment and integration remains a challenging task. Existing methods for integrating SRT data either do not use spatial information or assume that the mo...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Liu, X., Zeira, R., Raphael, B. J. Tags: METHODS Source Type: research

Enabling tradeoffs in privacy and utility in genomic data Beacons and summary statistics [METHODS]
The collection and sharing of genomic data are becoming increasingly commonplace in research, clinical, and direct-to-consumer settings. The computational protocols typically adopted to protect individual privacy include sharing summary statistics, such as allele frequencies, or limiting query responses to the presence/absence of alleles of interest using web services called Beacons. However, even such limited releases are susceptible to likelihood ratio–based membership-inference attacks. Several approaches have been proposed to preserve privacy, which either suppress a subset of genomic variants or modify query res...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Venkatesaramani, R., Wan, Z., Malin, B. A., Vorobeychik, Y. Tags: METHODS Source Type: research

Assessing transcriptomic reidentification risks using discriminative sequence models [METHODS]
Gene expression data provide molecular insights into the functional impact of genetic variation, for example, through expression quantitative trait loci (eQTLs). With an improving understanding of the association between genotypes and gene expression comes a greater concern that gene expression profiles could be matched to genotype profiles of the same individuals in another data set, known as a linking attack. Prior works show such a risk could analyze only a fraction of eQTLs that is independent owing to restrictive model assumptions, leaving the full extent of this risk incompletely understood. To address this challenge...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Sadhuka, S., Fridman, D., Berger, B., Cho, H. Tags: METHODS Source Type: research

Single-cell methylation sequencing data reveal succinct metastatic migration histories and tumor progression models [METHODS]
Recent studies exploring the impact of methylation in tumor evolution suggest that although the methylation status of many of the CpG sites are preserved across distinct lineages, others are altered as the cancer progresses. Because changes in methylation status of a CpG site may be retained in mitosis, they could be used to infer the progression history of a tumor via single-cell lineage tree reconstruction. In this work, we introduce the first principled distance-based computational method, Sgootr, for inferring a tumor's single-cell methylation lineage tree and for jointly identifying lineage-informative CpG sites that ...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Liu, Y., Li, X. C., Rashidi Mehrabadi, F., Schäffer, A. A., Pratt, D., Crawford, D. R., Malikic, S., Molloy, E. K., Gopalan, V., Mount, S. M., Ruppin, E., Aldape, K. D., Sahinalp, S. C. Tags: METHODS Source Type: research

Modeling and predicting cancer clonal evolution with reinforcement learning [METHODS]
Cancer results from an evolutionary process that typically yields multiple clones with varying sets of mutations within the same tumor. Accurately modeling this process is key to understanding and predicting cancer evolution. Here, we introduce clone to mutation (CloMu), a flexible and low-parameter tree generative model of cancer evolution. CloMu uses a two-layer neural network trained via reinforcement learning to determine the probability of new mutations based on the existing mutations on a clone. CloMu supports several prediction tasks, including the determination of evolutionary trajectories, tree selection, causalit...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Ivanovic, S., El-Kebir, M. Tags: METHODS Source Type: research

Efficient taxa identification using a pangenome index [METHODS]
We present new algorithms and methods for solving this problem. Specifically, given a collection of d documents, over an alphabet of size , we extend the r-index with additional words to support document listing queries for a pattern that occurs in documents in in time and space, where w is the machine word size. Applied in a bacterial mock community experiment, our method is up to three times faster than a comparable method that uses the standard r-index locate queries. We show that our method classifies both simulated and real nanopore reads at the strain level with higher accuracy compared with other approaches. Finally...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Ahmed, O., Rossi, M., Boucher, C., Langmead, B. Tags: METHODS Source Type: research

Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash [METHODS]
Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique to estimate set similarity that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. FracMinHash was recently introduced as a modification of MinHash to compensate for this lack of performance when set sizes differ. This approach has been successfully applied to metagenomic taxonomic profiling in the widely used tool sourmash gather. Although experimental evidence has bee...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Rahman Hera, M., Pierce-Ward, N. T., Koslicki, D. Tags: METHODS Source Type: research

A fast and scalable method for inferring phylogenetic networks from trees by aligning lineage taxon strings [METHODS]
The reconstruction of phylogenetic networks is an important but challenging problem in phylogenetics and genome evolution, as the space of phylogenetic networks is vast and cannot be sampled well. One approach to the problem is to solve the minimum phylogenetic network problem, in which phylogenetic trees are first inferred, and then the smallest phylogenetic network that displays all the trees is computed. The approach takes advantage of the fact that the theory of phylogenetic trees is mature, and there are excellent tools available for inferring phylogenetic trees from a large number of biomolecular sequences. A tree&nd...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Zhang, L., Abhari, N., Colijn, C., Wu, Y. Tags: METHODS Source Type: research