How to: bulk retrieval of archaeal genome sequences from the NCBI FTP site

While we’re on the topic of mistaking Archaea for Bacteria, here’s an issue with the NCBI FTP site that has long annoyed me and one workaround. Warning: I threw this together minutes ago and it’s not fully tested. Let’s cut to the chase. In the NCBI FTP site, archaeal genome data is stored along with bacterial genomes in a single directory named Bacteria. Aside from the fact that this is taxonomically incorrect, it makes bulk retrieval of archaeal data rather difficult. For example, I know that Methanococcoides burtonii is an archaeon and if I want to download its protein-coding genes (files ending with .ffn), I can do: wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcoides_burtonii_DSM_6242_uid58023/*.ffn What if I want all available ffn files for Archaea? Somehow, I need to know which of the organism names in the Bacteria directory correspond to Archaea. So here is one approach. 1. Download the summary file prokaryotes.txt Leaving aside issues with the term prokaryote, the NCBI FTP site contains a useful tab-delimited file which summarises current bacterial and archaeal genomes. wget ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt 2. Extract the Archaea We could use shell tools (awk, grep, cut and so on) but I’m going to go with R. First, read in the file and examine the contents of the Group column: p <- read.table("prokaryotes.txt", header = T, sep = "\t", quote = "", comment...
Source: What You're Doing Is Rather Desperate - Category: Bioinformaticians Authors: Tags: bioinformatics genomics database ftp ncbi sequence Source Type: blogs