XML parsing made easy: is that podcast getting longer?

Sometime in 2009, I began listening to a science podcast titled This Week in Virology, or TWiV for short. I thought it was pretty good and listened regularly up until sometime in 2016, when it seemed that most episodes were approaching two hours in duration. I listen to several podcasts when commuting to/from work, which takes up about 10 hours of my week, so I found it hard to justify two hours for one podcast, no matter how good. Were the episodes really getting longer over time? Let’s find out using R. One thing I’ve learned as a data scientist: management want to see the key points first. So here it is: Technical people want to see how we got there. It turns out that the podcast has an RSS feed (in XML format), containing detailed information about every episode to date. Once there was the XML package, which was OK, but now we have rvest and xml2, which makes life much easier. Full code and a markdown report can be found at Github, so don’t copy and paste from here. Here are the highlights. First, as data for each episode is found between item tags, it can be extracted from the feed like so: library(rvest) feed_items <- read_xml("http://twiv.microbeworld.libsynpro.com/twiv") %>% xml_nodes("item") Podcast date is stored in pubDate and duration in itunes:duration, so we can extract the text between those tags: feed_items %>% xml_nodes("pubDate") %>% xml_text() feed_items %>% xml_nodes(&quo...
Source: What You're Doing Is Rather Desperate - Category: Bioinformatics Authors: Tags: R statistics parsing podcast rss twiv xml Source Type: blogs