Filtering Fasta Sequences using the #NCBI C++ API. My notebook.
In my previous post (http://plindenbaum.blogspot.com/2015/01/compiling-c-hello-world-program-using.html) I've built a simple "Hello World" application using theNCBI C++ toolbox (http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/). Here, I'm gonna to extend the code in order to create a program filtering FASTA sequences on their sizes.This new application FastaFilterSize needs three new arguments:' (Source: YOKOFAKUN)
Source: YOKOFAKUN - January 29, 2015 Category: Bioinformatics Authors: Pierre Lindenbaum Source Type: blogs

How many referee reports do you write per year ?
If peer-review is a fundamental aspect of how science is done then we in academia are required to act as reviewers as part of our jobs. I assume there is no real argument as to if peer-review is needed. Instead one can argue about when to do it, in the life-time of a research project, and who should do it. Do we do it before the results are made public (pre-publishing) or after (post-publication) ? Do we have dedicated reviewers or should acting scientists do this ? The current dominant form of peer-review is done by active scientists and before disclosure of the results. This is all done anonymously and hidden from view. ...
Source: Evolution of Cellular Networks - January 16, 2015 Category: Cytology Source Type: blogs

State of the lab, year 2 – reaching steady state
CC BY ,Jason Paul SmithAt the end of last year I wrote up a short description of what it was like to start a group at the EMBL-EBI. I though it would be interesting to try to make it an yearly event so here is the second installment. It is always scary how fast a year passes by and it is interesting to note how my perspective of managing a research group is changing.During this year we said our first goodbyes as Vicky Kostiou (linkedin) finished her internship. We also welcomed several new members including Rahuman Sheriff (postodoc, linkedin) Haruna Imamura (postdoc, pubmed), Marta Strumillo (intern, linkedin) and Juan A ...
Source: Evolution of Cellular Networks - December 20, 2014 Category: Cytology Tags: academia state of the lab Source Type: blogs

Alumni from a small PhD program you never heard about got 4 ERC grants this year
Last Friday I heard the amazing news that a our group will be awarded an ERC starting grant to support our ongoing studies of the function and evolution of protein phosphorylation. I will write more about this soon. I also got the very exciting news that two other fellow alumni from the GABBA PhD program were also awarded a starting grant this year. Ana Carvalho and Nuno Alves both have their groups at the IBMC in Porto. Earlier in the year, another GABBA alumnus Rui Costa was awarded an ERC consolidator grant to support his neuroscience research at the Champalimaud institute in Lisbon. Rui had previously also receive...
Source: Evolution of Cellular Networks - December 5, 2014 Category: Cytology Source Type: blogs

Visualizing @GenomeBrowser liftOver/chain files using animated #SVG
I wrote a tool to visualize some UCSC "chain/liftOver" files as an animated SVG file. This tool is available on github at:https://github.com/lindenb/jvarkit/wiki/LiftOverToSVG"A liftOver file is a chain file, where for each region in the genome the alignments of the best/longest syntenic regions are used to translate features from one version of a genome to another.".SVG Elements and CSS styles (Source: YOKOFAKUN)
Source: YOKOFAKUN - October 29, 2014 Category: Bioinformatics Authors: Pierre Lindenbaum Source Type: blogs

Science publishers' pyramid structure and lock-in strategies
It is not recent news that AAAS will start a digital open access multidisciplinary journal. It is called Science Advances and it will be the 4th journal of the AAAS family of journals. As I have described in the past this is part of trend in science publishing to cover wider range of perceived impact. Publishers are aiming to have journals that are highly selective and that drive brand awareness but also have journals that can publish almost any article that pass a fairly low bar of being scientifically sound. This trend was spurred by the success of PLOS One that showed that it is financially viable to have large open acc...
Source: Evolution of Cellular Networks - October 16, 2014 Category: Cytology Tags: publishing Source Type: blogs

Using the Ensembl Regulatory Build to annotate some VCF files
via UCSC Genome Browser project announcements: "Data from the Ensembl Regulatory Build are now available in the UCSC Genome Browser as a public track hub for both hg19 and hg38. This track hub contains promoters and their flanking regions, enhancers, and many other regulatory features predicted across a number of cell lines using annotated segmentation states".For example looking at chr21: (Source: YOKOFAKUN)
Source: YOKOFAKUN - September 29, 2014 Category: Bioinformatics Authors: Pierre Lindenbaum Source Type: blogs

Collaborative postdoc fellowship opportunities
I interrupt this long blogging hiatus to point out two potential postdoc fellowship opportunities to work with our group at the EMBL-EBI. One is the EIPOD program that is an EMBL wide interdisciplinary program. For this fellowship the project is collaboration with Nassos Typas (genetics) and Jeroen Krijgsveld's (proteomics) groups at the EMBL in Heidelberg. Successful candidates would be studying how Salmonella uses post-translational modification effector proteins to regulate and subvert the host cell. It is important to note that EIPOD applicants must be interested in doing both the computational and experimental as...
Source: Evolution of Cellular Networks - September 5, 2014 Category: Cytology Tags: positions Source Type: blogs

Looking for “>” in all the wrong places
Ever wondered whether the “>” symbol can, or does, appear in FASTA sequence headers at positions other than the start of the line? .@sebhtml I say yes (because the original format didn't really specify anything) but bet that this would break lots of code.— Keith Bradnam (@kbradnam) August 13, 2014 I have a recent copy of the NCBI non-redundant nucleotide (nt) BLAST database on my server, so let’s take a look. The files are in a directory which is also named nt: # %i = sequence ID; %t = sequence title blastdbcmd -db /data/db/nt/nt -entry all -outfmt '%i %t' | grep ">" > ~/Doc...
Source: What You're Doing Is Rather Desperate - August 13, 2014 Category: Bioinformaticians Authors: nsaunders Tags: bioinformatics fasta formats twitter Source Type: blogs

Venn figures go wrong
6-way Venn bananaI thought nothing could top the classic “6-way Venn banana“, featured in The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. That is until I saw Figure 3 from Compact genome of the Antarctic midge is likely an adaptation to an extreme environment.5-way Venn roadkill What’s odd is that Figure 2 in the latter paper is a nice, clear R/ggplot2 creation, using facet_grid(), so someone knew what they were doing. That aside, the Antarctic midge paper is an interesting read; go check it out. This led to some amusing Twitter discussion which pointed me to *A New Rose :...
Source: What You're Doing Is Rather Desperate - August 12, 2014 Category: Bioinformaticians Authors: nsaunders Tags: bioinformatics genomics humour statistics graphics venn visualisation Source Type: blogs

When life gives you coloured cells, make categories
Let’s start by making one thing clear. Using coloured cells in Excel to encode different categories of data is wrong. Next time colleagues explain excitedly how “green equals normal and red = tumour”, you must explain that (1) they have sinned and (2) what they meant to do was add a column containing the words “normal” and “tumour”. I almost hesitate to write this post but…we have to deal with the world as it is, not as we would like it to be. So in the interests of just getting the job done: here’s one way to deal with coloured cells in Excel, should someone send them ...
Source: What You're Doing Is Rather Desperate - August 5, 2014 Category: Bioinformaticians Authors: nsaunders Tags: bioinformatics programming statistics excel spreadsheet Source Type: blogs

Hell is other people’s data
“Take a look at the TP53 mutation database“, my colleague suggested. “OK then, I will”, I replied. I present what follows as “a typical day in the life of a bioinformatician”. I click through to the Download page. At first sight, it’s reasonably promising. Excel files – not perfect, but beats a PDF – and ooh, TXT files. OK, I grab the “US version” (which uses decimal points, not commas for decimal numbers) of the uncurated database: wget http://p53.fr/TP53_database_download/TP53_tumor_database/TP53_US/UMDTP53_all_2012_R1_US.txt.zip unzip UMDTP53_all_2012_R...
Source: What You're Doing Is Rather Desperate - July 30, 2014 Category: Bioinformaticians Authors: nsaunders Tags: bioinformatics computing research diary database parsing tp53 Source Type: blogs

writing #rstats bindings for bwa-mem, my notebook.
I wanted to learn how to bind a C library to R, so I've created the following bindings for BWA. My code is available on github at :https://github.com/lindenb/rbwaMost of the C code was inspired from Heng Li's code https://github.com/lh3/bwa/blob/master/example.c. A short description of the C codeIn https://github.com/lindenb/rbwa/blob/master/rbwa.c: struct RBwaHandler This structure holds a (Source: YOKOFAKUN)
Source: YOKOFAKUN - July 30, 2014 Category: Bioinformaticians Authors: Pierre Lindenbaum Source Type: blogs

Including the hash for the current git-commit in a C program
Say you wrote the following simple C program: It includes a file "githash.h" that would contain the hash for the current commit in Git: Because you're working with a Makefile, the file "githash.h" is generated by invoking 'git rev-parse HEAD ': the file "githash.h" loooks like this: But WAIT that is not so simple, once the file 'githash.h' has been created it won't be updated by (Source: YOKOFAKUN)
Source: YOKOFAKUN - July 29, 2014 Category: Bioinformaticians Authors: Pierre Lindenbaum Source Type: blogs

A new low in “databases”: the PDF
I’ve had a half-formed, but not very interesting blog post in my head for some months now. It’s about a conversation I had with a PhD student, around 10 years ago, after she went to a bioinformatics talk titled “Excel is not a database” and how she laughed as I’d been telling her that “for years already”. That’s basically the post so as I say, not that interesting, except as an illustration that we’ve been talking about this stuff for a long time (and little has changed). HEp-2 or not HEp2?Anyway, we have something better. I was exploring PubMed Commons, which is becomi...
Source: What You're Doing Is Rather Desperate - July 28, 2014 Category: Bioinformaticians Authors: nsaunders Tags: humour publications cancer cell lines database Source Type: blogs