Virus hosts from NCBI Taxonomy web pages

A Biostars question asks whether the information about virus host on web pages like this one can be retrieved using Entrez Utilities. Pretty sure that the answer is no, unfortunately. Sometimes there’s no option but to scrape the web page, in the knowledge that this approach may break at any time. Here’s some very rough and ready Ruby code without tests or user input checks. It takes the taxonomy UID and returns the host, if there is one. No guarantees now or in the future! #!/usr/bin/ruby require 'nokogiri' require 'open-uri' def get_host(uid) url = "http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&lvl=3&lin=f&keep=1&srchmode=1&unlock&id=" + uid.to_s doc = Nokogiri::HTML.parse(open(url).read) data = doc.xpath("//td").collect { |x| x.inner_html.split("<br>") }.flatten data.each do |e| puts $1 if e =~ /Host:\s+<\/em>(.*?)$/ end end get_host(ARGV[0]) Save as taxhost.rb and supply the UID as first argument. Note: I chose 12345 off the top of my head, imagining that it was unlikely to be a virus and would make a good negative test. Turns out to be a phage! $ ruby taxhost.rb 12249 plants $ ruby taxhost.rb 12721 vertebrates $ ruby taxhost.rb 11709 vertebrates| human $ ruby taxhost.rb 12345 bacteria Filed under: bioinformatics, programming, ruby Tagged: parsing, taxonomy, virus
Source: What You're Doing Is Rather Desperate - Category: Bioinformatics Authors: Tags: bioinformatics programming ruby parsing taxonomy virus Source Type: blogs
More News: Bioinformatics