Improving quartet graph construction for scalable and accurate species tree estimation from gene trees [METHODS]
Summary methods are widely used to estimate species trees from genome-scale data. However, they can fail to produce accurate species trees when the input gene trees are highly discordant because of estimation error and biological processes, such as incomplete lineage sorting. Here, we introduce TREE-QMC, a new summary method that offers accuracy and scalability under these challenging scenarios. TREE-QMC builds upon weighted Quartet Max Cut, which takes weighted quartets as input and then constructs a species tree in a divide-and-conquer fashion, at each step forming a graph and seeking its max cut. The wQMC method has bee...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Han, Y., Molloy, E. K. Tags: METHODS Source Type: research

Leveraging family data to design Mendelian randomization that is provably robust to population stratification [METHODS]
Mendelian randomization (MR) has emerged as a powerful approach to leverage genetic instruments to infer causality between pairs of traits in observational studies. However, the results of such studies are susceptible to biases owing to weak instruments, as well as the confounding effects of population stratification and horizontal pleiotropy. Here, we show that family data can be leveraged to design MR tests that are provably robust to confounding from population stratification, assortative mating, and dynastic effects. We show in simulations that our approach, MR-Twin, is robust to confounding from population stratificat...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: LaPierre, N., Fu, B., Turnbull, S., Eskin, E., Sankararaman, S. Tags: METHODS Source Type: research

Ultrafast genome-wide inference of pairwise coalescence times [METHODS]
We describe how this approach works, show its performance on simulated and real data, and illustrate its use in studying recent positive selection in the 1000 Genomes Project data set. (Source: Genome Research)
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Schweiger, R., Durbin, R. Tags: METHODS Source Type: research

Fast inference of genetic recombination rates in biobank scale data [METHODS]
Although rates of recombination events across the genome (genetic maps) are fundamental to genetic research, the majority of current studies only use one standard map. There is evidence suggesting population differences in genetic maps, and thus estimating population-specific maps, are of interest. Although the recent availability of biobank-scale data offers such opportunities, current methods are not efficient at leveraging very large sample sizes. The most accurate methods are still linkage disequilibrium (LD)–based methods that are only tractable for a few hundred samples. In this work, we propose a fast and memo...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Naseri, A., Yue, W., Zhang, S., Zhi, D. Tags: METHODS Source Type: research

Minimal positional substring cover is a haplotype threading alternative to Li and Stephens model [METHODS]
The Li and Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel. For small panels, the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics. However, LS becomes inefficient when sample size is large, because of its linear time complexity. Recently the PBWT, an efficient data structure capturing the local haplotype matching among haplotypes, was proposed to offer a fast method for giving some optimal solution (Viterbi) to the LS HMM. Previously, we introduced the minimal positional substring cover (MPSC...
Source: Genome Research - August 24, 2023 Category: Genetics & Stem Cells Authors: Sanaullah, A., Zhi, D., Zhang, S. Tags: METHODS Source Type: research

Leveraging protein language models for accurate multiple sequence alignments [METHOD]
Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. Although successful, standard methods struggle on sets of proteins with low sequence identity: the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverages massive sequ...
Source: Genome Research - August 18, 2023 Category: Genetics & Stem Cells Authors: McWhite, C. D., Armour-Garb, I., Singh, M. Tags: METHOD Source Type: research

Elasmobranch genome sequencing reveals evolutionary trends of vertebrate karyotypic organization [RESEARCH]
Genomic studies of vertebrate chromosome evolution have long been hindered by the scarcity of chromosome-scale DNA sequences of some key taxa. One of those limiting taxa has been the elasmobranchs (sharks and rays), which harbor species often with numerous chromosomes and enlarged genomes. Here, we report the chromosome-scale genome assembly for the zebra shark Stegostoma tigrinum, an endangered species that has a relatively small genome among sharks (3.71 Gb), as well as for the whale shark Rhincodon typus. Our analysis employing a male-female comparison identified an X Chromosome, the first genomically characterized shar...
Source: Genome Research - August 17, 2023 Category: Genetics & Stem Cells Authors: Yamaguchi, K., Uno, Y., Kadota, M., Nishimura, O., Nozu, R., Murakumo, K., Matsumoto, R., Sato, K., Kuraku, S. Tags: RESEARCH Source Type: research

ZSWIM8 destabilizes many murine microRNAs and is required for proper embryonic growth and development [RESEARCH]
MicroRNAs (miRNAs) pair to sites in mRNAs to direct the degradation of these RNA transcripts. Conversely, certain RNA transcripts can direct the degradation of particular miRNAs. This target-directed miRNA degradation (TDMD) requires the ZSWIM8 E3 ubiquitin ligase. Here, we report the function of ZSWIM8 in the mouse embryo. Zswim8–/– embryos were smaller than their littermates and died near the time of birth. This highly penetrant perinatal lethality was apparently caused by a lung sacculation defect attributed to failed maturation of alveolar epithelial cells. Some mutant individuals also had heart ventricular...
Source: Genome Research - August 17, 2023 Category: Genetics & Stem Cells Authors: Shi, C. Y., Elcavage, L. E., Chivukula, R. R., Stefano, J., Kleaveland, B., Bartel, D. P. Tags: RESEARCH Source Type: research

The predicted RNA-binding protein regulome of axonal mRNAs [RESEARCH]
Neurons are morphologically complex cells that rely on the compartmentalization of protein expression to develop and maintain their cytoarchitecture. Targeting of RNA transcripts to axons is one of the mechanisms that allows rapid local translation of proteins in response to extracellular signals. 3'; untranslated regions (UTRs) of mRNA are noncoding sequences that play a critical role in determining transcript localization and translation by interacting with specific RNA-binding proteins (RBPs). However, how 3' UTRs contribute to mRNA metabolism and the nature of RBP complexes responsible for these functions remain elusiv...
Source: Genome Research - August 15, 2023 Category: Genetics & Stem Cells Authors: Luisier, R., Andreassi, C., Fournier, L. M., Riccio, A. Tags: RESEARCH Source Type: research

Dissecting and improving gene regulatory network inference using single-cell transcriptome data [METHOD]
Single-cell transcriptome data has been widely used to reconstruct gene regulatory networks (GRNs) controlling critical biological processes such as development and differentiation. While a growing list of algorithms has been developed to infer GRNs using such data, achieving an inference accuracy consistently higher than random guessing has remained challenging. To address this, it is essential to delineate how the accuracy of regulatory inference is limited. Here, we systematically characterized factors limiting the accuracy of inferred GRNs and demonstrated that using pre-mRNA information can help improve regulatory inf...
Source: Genome Research - August 14, 2023 Category: Genetics & Stem Cells Authors: Xue, L., Wu, Y., Lin, Y. Tags: METHOD Source Type: research

Aligning distant sequences to graphs using long seed sketches [METHOD]
Sequence-to-graph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and show that it yields a better time-accuracy trade-off in settings with up to a 25% mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a k-nearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformati...
Source: Genome Research - August 14, 2023 Category: Genetics & Stem Cells Authors: Joudaki, A., Meterez, A., Mustafa, H., Groot Koerkamp, R., Kahles, A., Rätsch, G. Tags: METHOD Source Type: research

Efficient mapping of accurate long reads in minimizer space with mapquik [METHOD]
DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introd...
Source: Genome Research - August 14, 2023 Category: Genetics & Stem Cells Authors: Ekim, B., Sahlin, K., Medvedev, P., Berger, B., Chikhi, R. Tags: METHOD Source Type: research

Telomerase-independent survival leads to a mosaic of complex subtelomere rearrangements in Chlamydomonas reinhardtii [RESEARCH]
Telomeres and subtelomeres, the genomic regions located at chromosome extremities, are essential for genome stability in eukaryotes. In the absence of the canonical maintenance mechanism provided by telomerase, telomere shortening induces genome instability. The landscape of the ensuing genome rearrangements is not accessible by short-read sequencing. Here, we leverage Oxford Nanopore Technologies long-read sequencing to survey the extensive repertoire of genome rearrangements in telomerase mutants of the model green microalga Chlamydomonas reinhardtii. In telomerase mutant strains grown for hundreds of generations, most c...
Source: Genome Research - August 14, 2023 Category: Genetics & Stem Cells Authors: Chaux, F., Agier, N., Garrido, C., Fischer, G., Eberhard, S., Xu, Z. Tags: RESEARCH Source Type: research

Single-cell methylation sequencing data reveal succinct metastatic migration histories and tumor progression models [METHOD]
Recent studies exploring the impact of methylation in tumor evolution suggest that although the methylation status of many of the CpG sites are preserved across distinct lineages, others are altered as the cancer progresses. Because changes in methylation status of a CpG site may be retained in mitosis, they could be used to infer the progression history of a tumor via single-cell lineage tree reconstruction. In this work, we introduce the first principled distance-based computational method, Sgootr, for inferring a tumor's single-cell methylation lineage tree and for jointly identifying lineage-informative CpG sites that ...
Source: Genome Research - August 11, 2023 Category: Genetics & Stem Cells Authors: Liu, Y., Li, X. C., Rashidi Mehrabadi, F., Schäffer, A. A., Pratt, D., Crawford, D. R., Malikic, S., Molloy, E. K., Gopalan, V., Mount, S. M., Ruppin, E., Aldape, K. D., Sahinalp, S. C. Tags: METHOD Source Type: research

Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2 [METHOD]
A genomic sketch is a small, probabilistic representation of the set of k-mers in a sequencing data set. Sketches are building blocks for large-scale analyses that consider similarities between many pairs of sequences or sequence collections. Although existing tools can easily compare tens of thousands of genomes, data sets can reach millions of sequences and beyond. Popular tools also fail to consider k-mer multiplicities, making them less applicable in quantitative settings. Here, we describe a method called Dashing 2 that builds on the SetSketch data structure. SetSketch is related to HyperLogLog (HLL) but discards use ...
Source: Genome Research - August 11, 2023 Category: Genetics & Stem Cells Authors: Baker, D. N., Langmead, B. Tags: METHOD Source Type: research