Searching for duplicate resource names in PMC article titles
I enjoyed this article by Keith Bradnam, and the associated tweets, on the problem of duplicated names for bioinformatics software.
I figured that to some degree at least, we should be able to search for such instances, since the titles of published articles that describe software often follow a particular pattern. There may even be a grammatical term for it, but I’ll call it the announcement colon:
eDuS: Segmental Duplication Simulator
Reveel: large-scale population genotyping using low-coverage sequencing data
RNF: a general framework to evaluate NGS read mappers
Hammock: A Hidden Markov model-based peptide clustering algorithm to identify protein-interaction consensus motifs in large datasets
You get the idea. “XXX COLON a [METHOD] to [DO SOMETHING] using [SOME DATA].”
Let’s go in search of announcement colons, using titles from the PubMed Central dataset. You can find this mini-project at Github.
1. Download PMC data
I use wget. The compressed archives are still quite large (~ 3-5 GB), so this may take some time.
wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.A-B.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.C-H.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.I-N.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.O-Z.tar.gz
find ./ -name "*.tar.gz" -exec tar zxvf {} \;
2. Parse the titles
Now, of course there will be many article titles that contain a colon and are nothing to do with software names. W...
Source: What You're Doing Is Rather Desperate - Category: Bioinformatics Authors: nsaunders Tags: open access programming ruby statistics duplicates pmc Source Type: blogs