DATA SIMPLIFICATION: Doublet Lists

Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 17, 2016). I hope I can convince you that this is a book worth reading. Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.Yesterday's blog covered lists of single words. Today we'll do doublets. Doublet lists (lists of two-word terms that occur in common usage or in a body of text) are a highly underutilized resource. The special value of doublets is that single word terms tend to have multiple meanings, while doublets tend to have specific meaning. Here are a few examples: The word "rose" can mean the past tense of rise, or the flower. The doublet "rose garden" refers specifically to a place where the rose flower grows. The word "lead" can mean a verb form of the infinitive, "to lead", or it can refer to the metal. The term "lead paint" has a different meaning than "lead violinist". Furthermore, every multiword term of length greater than two can be constructed with overlapping doublets, with each doublet having a specific meaning. For example, "Lincoln Continental convertible" = "Lincoln Continental" + "Continental convertible". The three words, "Lincoln", "Continental", and "convertible" all have different meanings, under different circumstances. But the two doublets, "Lincoln Continental" and "Continental Convertible" would be unusual to encounter on their own, and produ...
Source: Specified Life - Category: Information Technology Tags: complexity computer science data analysis data repurposing data simplification doublet lists n-grams open source tools word lists Source Type: blogs