Assessing Data Veracity
May 7, 2019
There are different takes on what veracity refers to, but the overall consensus is that data veracity reflects the truthfulness of a data set and your level of confidence or trust in it. I'll take this a step further and say that data veracity is your level of confidence/trust in the data based on its provenance as well as the data processing method.
Think about this: when you get a box of chocolate which you haven't tried before, how do you estimate how good it is? The first step is to look where it was made, by what shop or brand. You can mainly assess its quality by its provenance. As a second step, you probably also want to ensure that after you open the box, you won't taint the chocolates somehow before you taste them.
Data veracity helps us better understand the risks associated with analysis and business decisions based on a particular big data set.
Looking at a data example, imagine you want to enrich your sales prospect information with employment data — where those customers work and their job titles. Not only this can provide you with additional contact data, but it can also help you create different market segments and do a better job of serving them.
LinkedIn collects lots of employment data, but unfortunately you can't purchase it from them. So what can you do? You might go to another third-party provider of who claims to scrape LinkedIn data from search engine results (a legally grey area in my opinion; I'm not a legal expert so let's just treat this as a theoretical example). Therefore, you might consider purchasing this LinkedIn employment data, but how do you gauge its veracity?
Well, you need to ask these questions to the service provider:
- Who created and contributed to the data source?
- When was the data collected?
- Was the original data source enriched in any way?
- What methodology do they follow in collecting the data?
- What algorithm do they use to match records and what are the matching confidence levels?
- Were only certain industries or locations included in the data source?
- Has the information been edited or modified in any way?
- Did the creators summarize the information?
Then, after answering all these questions, you will also need to understand how, where, and when you will integrate this data with your own. What are the definitions, extract, transform, and load (ETL) procedures, and business rules which you will follow?
Answers to these questions are necessary to determine the veracity of this big data source. To expand on the employment data example, what if your customer base only included lawyers? Well, then you wouldn't choose LinkedIn as your data source but rather go to the American and/or Canadian Bar Association. Why? Because the bar associations have a higher data veracity for this type of data than one that is self-reported.
Veracity is impacted by human bias and error, lack of data governance and data validation, software bugs which can lead to duplication and variability, volatility, and lack of security. We all wish for these to be addressed as we consider them important, at least in theory, but the reality is that not all data vendors monitor these variables enough to fully address them and follow the trifecta of data quality management. That's probably why IBM Big Data & Analytics Hub estimates poor data costs the US economy $3.1 trillion every year.
Veracity is rarely achieved in big data due to its high volume, velocity, variety, variability, and overall complexity. In turn, we take solace in understanding that knowledge of data's veracity helps us better understand the risks associated with analysis and business decisions based on a particular big data set. So, find out as much as possible about your data sources, big and small, to better gauge the veracity.