Finally, NCBI Genomes recognises Archaea*
I’ve been complaining about this for years. They fixed it. The NCBI have reorganised their genomes FTP site and finally, Archaea are not lumped in with Bacteria. GenBank: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/ RefSeq: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/ Archaea are still included in the ASSEMBLY_BACTERIA directory; hopefully that’s next on the list. [*] to be fair, they’ve always recognised Archaea – just not in a form that makes downloads convenientFiled under: bioinformatics, genomics, uncategorized Tagged: archaea, bacteria, ftp, genomes, ncbi (Source: What You're D...
Source: What You're Doing Is Rather Desperate - August 26, 2014 Category: Bioinformaticians Authors: nsaunders Tags: bioinformatics genomics archaea bacteria ftp genomes ncbi Source Type: blogs

Looking for “>” in all the wrong places
Ever wondered whether the “>” symbol can, or does, appear in FASTA sequence headers at positions other than the start of the line? .@sebhtml I say yes (because the original format didn't really specify anything) but bet that this would break lots of code.— Keith Bradnam (@kbradnam) August 13, 2014 I have a recent copy of the NCBI non-redundant nucleotide (nt) BLAST database on my server, so let’s take a look. The files are in a directory which is also named nt: # %i = sequence ID; %t = sequence title blastdbcmd -db /data/db/nt/nt -entry all -outfmt '%i %t' | grep ">" > ~/Doc...
Source: What You're Doing Is Rather Desperate - August 13, 2014 Category: Bioinformaticians Authors: nsaunders Tags: bioinformatics fasta formats twitter Source Type: blogs

Venn figures go wrong
6-way Venn bananaI thought nothing could top the classic “6-way Venn banana“, featured in The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. That is until I saw Figure 3 from Compact genome of the Antarctic midge is likely an adaptation to an extreme environment.5-way Venn roadkill What’s odd is that Figure 2 in the latter paper is a nice, clear R/ggplot2 creation, using facet_grid(), so someone knew what they were doing. That aside, the Antarctic midge paper is an interesting read; go check it out. This led to some amusing Twitter discussion which pointed me to *A New Rose :...
Source: What You're Doing Is Rather Desperate - August 12, 2014 Category: Bioinformaticians Authors: nsaunders Tags: bioinformatics genomics humour statistics graphics venn visualisation Source Type: blogs

When life gives you coloured cells, make categories
Let’s start by making one thing clear. Using coloured cells in Excel to encode different categories of data is wrong. Next time colleagues explain excitedly how “green equals normal and red = tumour”, you must explain that (1) they have sinned and (2) what they meant to do was add a column containing the words “normal” and “tumour”. I almost hesitate to write this post but…we have to deal with the world as it is, not as we would like it to be. So in the interests of just getting the job done: here’s one way to deal with coloured cells in Excel, should someone send them ...
Source: What You're Doing Is Rather Desperate - August 5, 2014 Category: Bioinformaticians Authors: nsaunders Tags: bioinformatics programming statistics excel spreadsheet Source Type: blogs

Hell is other people’s data
“Take a look at the TP53 mutation database“, my colleague suggested. “OK then, I will”, I replied. I present what follows as “a typical day in the life of a bioinformatician”. I click through to the Download page. At first sight, it’s reasonably promising. Excel files – not perfect, but beats a PDF – and ooh, TXT files. OK, I grab the “US version” (which uses decimal points, not commas for decimal numbers) of the uncurated database: wget http://p53.fr/TP53_database_download/TP53_tumor_database/TP53_US/UMDTP53_all_2012_R1_US.txt.zip unzip UMDTP53_all_2012_R...
Source: What You're Doing Is Rather Desperate - July 30, 2014 Category: Bioinformaticians Authors: nsaunders Tags: bioinformatics computing research diary database parsing tp53 Source Type: blogs

A new low in “databases”: the PDF
I’ve had a half-formed, but not very interesting blog post in my head for some months now. It’s about a conversation I had with a PhD student, around 10 years ago, after she went to a bioinformatics talk titled “Excel is not a database” and how she laughed as I’d been telling her that “for years already”. That’s basically the post so as I say, not that interesting, except as an illustration that we’ve been talking about this stuff for a long time (and little has changed). HEp-2 or not HEp2?Anyway, we have something better. I was exploring PubMed Commons, which is becomi...
Source: What You're Doing Is Rather Desperate - July 28, 2014 Category: Bioinformaticians Authors: nsaunders Tags: humour publications cancer cell lines database Source Type: blogs

Tool tip: dropbox-restore
I’m currently rather sleep-deprived and prone to doing stupid things. Like this, for example: rsync -av ~/Dropbox /path/to/backup/directory/ where the directory /path/to/backup/directory already contains a much-older Dropbox directory. So when I set up a new machine, install Dropbox and copy the Dropbox directory back to its default location – hey! What happened to all my space? What are all these old files? Oh wait…I forgot to delete: rsync -av --delete ~/Dropbox /path/to/backup/directory/ Now, files can be restored of course, but not when there are thousands of them and I don’t even know what&...
Source: What You're Doing Is Rather Desperate - July 23, 2014 Category: Bioinformaticians Authors: nsaunders Tags: computing research diary software dropbox github Source Type: blogs

My own “404 not found”: making amends using Github
Conclusion I’m now in a position to run this analysis at regular intervals – probably monthly – and push the results to Github. Watch this space for any interesting developments. Making this kind of pipeline available and reproducible by others is less easy. If you have access to a machine with all the prerequisites, clone the repository, replace the symlinks to database directories with real directories, change to the directory code/ruby and run the rake tasks – well, you might be able to do the same thing. But you’d probably rather sequence the human microbiome.Filed under: bioinformatics, r...
Source: What You're Doing Is Rather Desperate - July 21, 2014 Category: Bioinformaticians Authors: nsaunders Tags: bioinformatics research diary ruby archaea database est github Source Type: blogs

Converting a spreadsheet of SMILES: my first OSM contribution
I’ve long admired the work of the Open Source Malaria Project. Unfortunately time and “day job” constraints prevent me from being as involved as I’d like. So: I was happy to make a small contribution recently in response to this request for help: Can anyone help @O_S_M to convert this spreadsheet ( malaria.ourexperiment.org/biological_dat…) into chemical structures with data? #openscience #realtimechem— Alice Williamson (@all_isee) June 24, 2014 Note – this all works fine under Linux; there seem to be some issues with Open Babel library files under OSX. First step: make that data usab...
Source: What You're Doing Is Rather Desperate - July 1, 2014 Category: Bioinformaticians Authors: nsaunders Tags: open science programming statistics cheminformatics conversion malaria osm smiles Source Type: blogs

utils4bioinformatics: all those “little snippets” in one place
Over the years, I’ve written a lot of small “utility scripts”. You know the kind of thing. Little code snippets that facilitate research, rather than generate research results. For example: just what are the fields that you can use to qualify Entrez database searches? Typically, they end up languishing in long-forgotten Dropbox directories. Sometimes, the output gets shared as a public link. No longer! As of today, “little code snippets that do (hopefully) useful things” have a new home at Github. Also as of today: there’s not much there right now, just the aforementioned Entrez database...
Source: What You're Doing Is Rather Desperate - June 23, 2014 Category: Bioinformaticians Authors: nsaunders Tags: bioinformatics research diary software code github repository Source Type: blogs

This is why code written by scientists gets ugly
There’s a lot of discussion around why code written by self-taught “scientist programmers” rarely follows what a trained computer scientist would consider “best practice”. Here’s a recent post on the topic. One answer: we begin with exploratory data analysis and never get around to cleaning it up. An example. For some reason, a researcher (let’s call him “Bob”) becomes interested in a particular dataset in the GEO database. So Bob opens the R console and use the GEOquery package to grab the data: Update: those of you commenting “should have used Python insteadR...
Source: What You're Doing Is Rather Desperate - May 14, 2014 Category: Bioinformaticians Authors: nsaunders Tags: programming research diary statistics Source Type: blogs

When is db=all not db=all? When you use Entrez ELink.
Just a brief technical note. I figured that for a given compound in PubChem, it would be interesting to know whether that compound had been used in a high-throughput experiment, which you might find in GEO. Very easy using the E-utilities, as implemented in the R package rentrez: library(rentrez) links <- entrez_link(dbfrom = "pccompound", db = "gds", id = "62857") length(links$pccompound_gds) # [1] 741 Browsing the rentrez documentation, I note that db can take the value “all”. Sounds useful! links <- entrez_link(dbfrom = "pccompound", db = "all", id =...
Source: What You're Doing Is Rather Desperate - April 29, 2014 Category: Bioinformaticians Authors: nsaunders Tags: bioinformatics programming research diary api elink entrez ncbi rentrez Source Type: blogs

On the road: CSS and eResearch Conference 2014
Next week I’ll be in Melbourne for one of my favourite meetings, the annual Computational and Simulation Sciences and eResearch Conference. The main reason for my visit is the Bioinformatics FOAM workshop. Day 1 (March 27) is not advertised since it is an internal CSIRO day, but I’ll be presenting a talk titled “SQL, noSQL or no database at all? Are databases still a core skill?“. Day 2 (March 28) is open to all and I’ll be talking about “Learning from complete strangers: social networking for bioinformaticians“. I imagine these and other talks will appear on Slideshare soon, at bo...
Source: What You're Doing Is Rather Desperate - March 20, 2014 Category: Bioinformaticians Authors: nsaunders Tags: bioinformatics computing meetings travel conference foam melbourne Source Type: blogs

“Advance” access and DOIs: what’s the problem?
A DOI, this morning When I arrive at work, the first task for the day is “check feeds”. If I’m lucky, in the “journal TOCs” category, there will be an abstract that looks interesting, like this one on the left (click for larger version). Sometimes, the title is a direct link to the article at the journal website. Often though, the link is a Digital Object Identifier or DOI. Frequently, when the article is labelled as “advance access” or “early”, clicking on the DOI link leads to a page like the one below on the right. DOI #fail In the grand scheme of things I suppose th...
Source: What You're Doing Is Rather Desperate - March 9, 2014 Category: Bioinformaticians Authors: nsaunders Tags: doi journals publishing Source Type: blogs

A minor update to my “apply functions” post
One of my more popular posts is A brief introduction to “apply” in R. Come August, it will be four years old. Technology moves on, old blog posts do not. So: thanks to BioStar user zx8754 for pointing me to this Stack Overflow post, in which someone complains that the code in the post does not work as described. The by example is now fixed. Side note: I often find “contact the author” is the most direct approach to solving this kind of problem ;) always happy to be contacted.Filed under: R, statistics, this blog Tagged: biostar, stack overflow, stackexchange (Source: What You're Doing Is Rather Desperate)
Source: What You're Doing Is Rather Desperate - February 27, 2014 Category: Bioinformaticians Authors: nsaunders Tags: R statistics this blog biostar stack overflow stackexchange Source Type: blogs