An Analysis of Contributions to PubMed Commons
I recently saw a tweet floating by which included a link to some recent statistics from PubMed Commons, the NCBI service for commenting on scientific articles in PubMed. Perhaps it was this post at their blog. So I thought now would be a good time to write some code to analyse PubMed Commons data. The tl;dr version: here’s the Github repository and the RPubs report. For further details and some charts, read on. Currently, there is no access to PubMed Commons data via the NCBI Entrez API aside from a PubMed search filter to return articles that have comments. However, a Google search for “pubmed commons apiR...
Source: What You're Doing Is Rather Desperate - December 1, 2016 Category: Bioinformatics Authors: nsaunders Tags: publications R statistics comments ncbi pubmed commons Source Type: blogs

Putting data on maps using R: easier than ever
New Zealand earthquake density 2010 – November 2016Using R to add data to maps has been pretty straightforward for a few years now. That said, it seems easier than ever to do things like use map APIs (e.g. Google, Open Street Map), overlay quite complex data visualisations (e.g. “heatmap-style” densities) and even generate animations. A couple of key R packages in this space: ggmap and gganimate. To illustrate, I’ve used data from the recent New Zealand earthquake to generate some static maps and an animation. Here’s the Github repository and a report published at RPubs. Thanks to Florian Tesc...
Source: What You're Doing Is Rather Desperate - November 24, 2016 Category: Bioinformatics Authors: nsaunders Tags: R statistics earthquakes ggplot2 maps new zealand visualisation Source Type: blogs

The y-axis: to zero or not to zero
This article tells us that “it’s OK not to start your y-axis at zero”, but then states that “column and bar charts should always have zeroed axes”. They use a chart from the Twitter IPO as an example. If you were waiting for the obligatory bad-mouthing of Excel, look no further than a follow-up Tweet by the chart author. @DocJohnG True. Also contact Microsoft Excel, let them know the default y-axis is simply unacceptable; lazy people like me need nudging. — D Yanagizawa-Drott (@yanagiz) November 11, 2016 Onwards. What if we use a line chart instead? ggplot(subset(elections.2, vote =...
Source: What You're Doing Is Rather Desperate - November 20, 2016 Category: Bioinformatics Authors: nsaunders Tags: R statistics charts graph politics visualisation Source Type: blogs

Let ’ s (briefly) revisit the Nobel API
It’s always nice when 12-month old code runs without a hitch. Not sure why this did not become a Github repo first time around, but now it is: my RMarkdown code to generate a report using data from the Nobel Prize API. Now you too can generate a “gee, it’s all old white men” chart as seen in The Economist – Greying of the Nobel laureates, BBC News – Why are Nobel Prize winners getting older? and no doubt, many other outlets every year including me at RPubs, updated from 2015. As for myself, perhaps I should be offering my services to news outlets instead of publishing on blogs and obscur...
Source: What You're Doing Is Rather Desperate - October 9, 2016 Category: Bioinformatics Authors: nsaunders Tags: programming statistics api nobel Source Type: blogs

Data corruption using Excel: 12+ years and counting
This study examined 35 175 supplementary Excel data files from 3 597 published articles. Simple yet clever, isn’t it. I bet you wish you’d thought of doing that. I do. The conclusion: about 20% of articles have associated data files in which gene names have been corrupted by Excel. What if there is no tomorrow? There wasn’t one today. We tell you not to use Excel. You counter with a host of reasons why you have to use Excel. None of them are good reasons. I don’t know what else to say. Except to reiterate that probably 80% or more of the data analyst’s time is spent on data cleaning and a good...
Source: What You're Doing Is Rather Desperate - August 25, 2016 Category: Bioinformatics Authors: nsaunders Tags: bioinformatics publications software excel genes Source Type: blogs

Hiatus, indefinite
May. No blog posts yet in 2016. “What’s going on Neil?” asked no-one at all. For anyone who may be wondering… Last November, I resigned from my position with my previous employer after almost 7 years. Just before Christmas, I was offered a position as a data scientist with Life Letters, a Sydney-based healthcare technology start-up. I started working there in early January and so far, it has been a terrific experience. Had I known how enjoyable it could be, I would have made a move like this 10 years ago. Career advice: there are many more jobs that can engage scientists and utilise their skills than academic resea...
Source: What You're Doing Is Rather Desperate - May 4, 2016 Category: Bioinformatics Authors: nsaunders Tags: career personal this blog Source Type: blogs

This blog in 2015
It must be time for the annual report, kindly generated by the people from WordPress at the end of each year. I’m pleased to see that I still averaged almost 2 posts a month, given that it was a difficult year in many ways (more on that later). Visitors from 202 countries! And if I never blogged again, it seems that people will want to learn about R’s apply functions for a long time to come. 2016 is going to be a bit “different”. Look out for the blog post which explains how and why, coming soon… Filed under: statistics, this blog Tagged: annual report, wordpress (Source: What You're Doing Is Rather Desperate)
Source: What You're Doing Is Rather Desperate - December 29, 2015 Category: Bioinformatics Authors: nsaunders Tags: statistics this blog annual report wordpress Source Type: blogs

Variants + Spark = VariantSpark
Just a short note to alert you to a publication with my name on it. Great work by lead author and former colleague Aidan; I just did “the Gephi stuff”. If you’re interested in bioinformatics applications of Apache Spark, take a look at: VariantSpark: population scale clustering of genotype information Happy to report it is open access.Filed under: bioinformatics, publications Tagged: 1000 genomes, apache, machine learning, spark, variant (Source: What You're Doing Is Rather Desperate)
Source: What You're Doing Is Rather Desperate - December 29, 2015 Category: Bioinformatics Authors: nsaunders Tags: bioinformatics publications 1000 genomes apache machine learning spark variant Source Type: blogs

Novelty: an update
A recent tweet: @neilfws I enjoyed this: https://t.co/ynyHRbgpLN Have you published (or are you thinking about publishing) this analysis anywhere? — Marcus Munafo (@MarcusMunafo) October 7, 2015 PubMed articles containing “novel” in title or abstract 1845 – 2014made me think (1) has it really been 5 years, (2) gee, my ggplot skills were dreadful back then and (3) did I really not know how to correct for the increase in total publications? So here is the update, at Github and a document at RPubs. “Novel” findings, as judged by the usage of that word in titles and abstracts really have ...
Source: What You're Doing Is Rather Desperate - October 21, 2015 Category: Bioinformatics Authors: nsaunders Tags: publications R ruby statistics literature ncbi pubmed rstats Source Type: blogs

R and the Nobel Prize API
The Nobel Prizes. Love them? Hate them? Are they still relevant, meaningful? Go on admit it, you always imagined you would win one day. Whatever you think of them, the 2015 results are in. What’s more, the good people of the Nobel Foundation offer us free access to data via an API. I’ve published a document over at RPubs, showing some of the ways to access and analyse their data using R. Just to get you started: library(jsonlite) u <- "http://api.nobelprize.org/v1/laureate.json" nobel <- fromJSON(u) In this post, just the highlights. Click the images for larger versions. 1. Gender The...
Source: What You're Doing Is Rather Desperate - October 20, 2015 Category: Bioinformatics Authors: nsaunders Tags: R statistics api ggplot2 laureates nobel prizes rest rstats Source Type: blogs

Searching for duplicate resource names in PMC article titles
I enjoyed this article by Keith Bradnam, and the associated tweets, on the problem of duplicated names for bioinformatics software. I figured that to some degree at least, we should be able to search for such instances, since the titles of published articles that describe software often follow a particular pattern. There may even be a grammatical term for it, but I’ll call it the announcement colon: eDuS: Segmental Duplication Simulator Reveel: large-scale population genotyping using low-coverage sequencing data RNF: a general framework to evaluate NGS read mappers Hammock: A Hidden Markov model-based peptide cluste...
Source: What You're Doing Is Rather Desperate - September 16, 2015 Category: Bioinformatics Authors: nsaunders Tags: open access programming ruby statistics duplicates pmc Source Type: blogs

Virus hosts from NCBI taxonomy: now at Github
After my previous post on extracting virus hosts from NCBI Taxonomy web pages, Pierre wrote: @neilfws idea: create a new flat-file-based database on github ? — Pierre Lindenbaum (@yokofakun) June 2, 2015 An excellent idea and here’s my first attempt. Here’s a count of hosts. By the way NCBI, it’s environment. cut -f4 virus_host.tsv | sort | uniq -c 1301 283 algae 114 archaea 4509 bacteria 8 diatom 51 enviroment 267 fungi 1 fungi| plants| invertebrates 4 human 761 invertebrates 181 invertebrates| plants 7 invertebrates| verteb...
Source: What You're Doing Is Rather Desperate - June 8, 2015 Category: Bioinformatics Authors: nsaunders Tags: bioinformatics programming ruby github taxonomy virus Source Type: blogs

Virus hosts from NCBI Taxonomy web pages
A Biostars question asks whether the information about virus host on web pages like this one can be retrieved using Entrez Utilities. Pretty sure that the answer is no, unfortunately. Sometimes there’s no option but to scrape the web page, in the knowledge that this approach may break at any time. Here’s some very rough and ready Ruby code without tests or user input checks. It takes the taxonomy UID and returns the host, if there is one. No guarantees now or in the future! #!/usr/bin/ruby require 'nokogiri' require 'open-uri' def get_host(uid) url = "http://www.ncbi.nlm.nih.gov/Taxonomy/Brows...
Source: What You're Doing Is Rather Desperate - June 2, 2015 Category: Bioinformatics Authors: nsaunders Tags: bioinformatics programming ruby parsing taxonomy virus Source Type: blogs

Analysis of gene expression timecourse data using maSigPro
ANXA11 expression in human smooth muscle aortic cells post-ILb1 exposureAbout a year ago, I did a little work on a very interesting project which was trying to identify blood-based biomarkers for the early detection of stroke. The data included gene expression measurements using microarrays at various time points after the onset of ischemia (reduced blood supply). I had not worked with timecourse data before, so I went looking for methods and found a Bioconductor package, maSigPro, which did exactly what I was looking for. In combination with ggplot2, it generated some very attractive and informative plots of gene expressi...
Source: What You're Doing Is Rather Desperate - May 29, 2015 Category: Bioinformatics Authors: nsaunders Tags: bioinformatics statistics bioconductor geo masigpro microarray timecourse tutorial Source Type: blogs

Searching for the Steamer retroelement in the ocean metagenome
Location of BLAST (tblastn) hits Mya arenaria GagPol (AIE48224.1) vs GOS contigsLast week, I was listening to episode 337 of the podcast This Week in Virology. It concerned a retrovirus-like sequence element named Steamer, which is associated with a transmissible leukaemia in soft shell clams. At one point the host and guests discussed the idea of searching for Steamer-like sequences in the data from ocean metagenomics projects, such as the Global Ocean Sampling expedition. Sounds like fun. So I made an initial attempt, using R/ggplot2 to visualise the results. To make a long story short: the initial BLAST results are not ...
Source: What You're Doing Is Rather Desperate - May 25, 2015 Category: Bioinformatics Authors: nsaunders Tags: bioinformatics statistics cancer clam GOS metagenomics ocean retroelement steamer twiv virus Source Type: blogs