DATA SIMPLIFICATION: Poor Identifiers, Horrific Consequences

Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 23, 2016). I hope I can convince you that this is a book worth reading. Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.All information systems, all databases, and all good collections of data are best envisioned as identifier systems to which data (belonging to the identifier) can be added over time. If the system is corrupted (e.g., multiple identifiers for the same object, data belonging to one object incorrectly attached to other objects), then the system has no value. You can't trust any of the individual records, and you can't trust any of the analyses performed on collections of records. Furthermore, if the data from a corrupted system is merged with the data from other systems, then all analyses performed on the aggregated data becomes unreliable and useless. This holds true even when every other contributor to the system shares reliable data. Without proper identifiers, the following may occur: data values can be assigned to the wrong data objects; data objects can be replicated under different identifiers, with each replicant having an incomplete data record (i.e., an incomplete set of data values); the total number of data objects cannot be determined; data sets cannot be verified; and the results of data set analyses will not be valid. In the past, individuals w...
Source: Specified Life - Category: Information Technology Tags: complexity computer science data analysis data repurposing data simplification data wrangling identifiers information science simplifying data taming data Source Type: blogs