ContentMine: High-Throughput Extractions of Facts from Scientific Articles

The NIH Frontiers in Data Science Lecture Series" ContentMine: High-Throughput Extractions of Facts from Scientific Articles " Dr. Peter Murray-Rust, University of Cambridge and Founder of the ContenMine Project There are millions of scientific articles published each year, but much of the content is not accessible because it is non-machine-readable or hidden in supplemental information or bitmapped figures. Content Mining (Text-and-Data Mining/TDM) turns this semi-structured material into semantic form (XML) and annotates it with known metadata. EuropePMC, which works closely with PubMedCentral, provides an API for rapid fulltext search and retrieval of fulltext. ContentMine software then extracts " facts " with a number of " facet " tools: word search, regexes, bespoke text tools, chemical NLP (OSCAR), and certain diagram types (phylogenetic trees). The " facts " can be mapped onto triples and incorporated into Wikidata or used to annotate the text to help human readers. Common facets are often supported by dictionaries, but they can be easily extended by anyone with a list of words. Using heuristics, data can be extracted from common diagram types. The vision is to develop a communal open toolbox that can be extended and validated for a wide range of purposes. However, many rightsholders are trying to control TDM through technical and legal means. There is a recent legal exception in the U.K. that allows for text mining of facts for scientific research. The University of C...

http://videocast.nih.gov/summary.asp?live=20292

Source: Videocast - All Events - November 14, 2016 Category: Journals (General) Tags: Upcoming Events Source Type: video