Searching for duplicate resource names in PMC article titles

I enjoyed this article by Keith Bradnam, and the associated tweets, on the problem of duplicated names for bioinformatics software. I figured that to some degree at least, we should be able to search for such instances, since the titles of published articles that describe software often follow a particular pattern. There may even be a grammatical term for it, but I’ll call it the announcement colon: eDuS: Segmental Duplication Simulator Reveel: large-scale population genotyping using low-coverage sequencing data RNF: a general framework to evaluate NGS read mappers Hammock: A Hidden Markov model-based peptide clustering algorithm to identify protein-interaction consensus motifs in large datasets You get the idea. “XXX COLON a [METHOD] to [DO SOMETHING] using [SOME DATA].” Let’s go in search of announcement colons, using titles from the PubMed Central dataset. You can find this mini-project at Github. 1. Download PMC data I use wget. The compressed archives are still quite large (~ 3-5 GB), so this may take some time. wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.A-B.tar.gz wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.C-H.tar.gz wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.I-N.tar.gz wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.O-Z.tar.gz find ./ -name "*.tar.gz" -exec tar zxvf {} \; 2. Parse the titles Now, of course there will be many article titles that contain a colon and are nothing to do with software names. W...

https://nsaunders.wordpress.com/2015/09/16/searching-for-duplicate-resource-name...

Source: What You're Doing Is Rather Desperate - September 16, 2015 Category: Bioinformatics Authors: nsaunders Tags: open access programming ruby statistics duplicates pmc Source Type: blogs

More News: Bioinformaticians | Clinical Trials | Statistics