The Winding Road to Data Quality
Aug 22, 2018
Data quality, defined as “a perception or an assessment of data's fitness to serve its purpose in a given context,” might seem straightforward at first but is actually rather difficult to evaluate and maintain. This is in part because the threshold dividing high-quality data from low-quality data depends on a number of variables, including the data’s intended use (or uses) and context (which may change over time) as well as a subjective understanding of the data’s accuracy, completeness, and reliability.
The clean, uncomplicated datasets we encounter early on in our careers may help us grasp data quality fundamentals but are unlikely to prepare us for the more nuanced scenarios we can expect to encounter later on. Let’s look at how basic quality control practices mature in these more complex data environments.
Stage 1: Referential Integrity
When you first start working with relational databases, you’ll come across something called a referential integrity check. It’s a quality control feature built right into the database that enforces the relationships between tables: “any foreign key field must agree with the primary key that is referenced by the foreign key.” If you’re unfamiliar with these terms and missed our post on table joining, here’s a classroom-style example to illustrate the concept.
Let’s imagine we’re a medical office, and we have patient records stored in a number of database tables, among which are the PATIENT table and the ADDRESS table. Because our PATIENT table references the ADDRESS table, linking each patient to his or her address, our database will require that we supply an address key in order to add a new patient to the system.
Given the address table below, the database would bar us from adding the highlighted patient record because its address id does not correspond to an address id in the ADDRESS table.
The referential integrity constraint is predicated on the idea that quality data exists where it is supposed to exist. Because address #22222 isn’t where it’s supposed to be (in the ADDRESS table), the integrity of the entire Jess Keating record is questioned.
But what if our fictitious medical office is actually an emergency room at a hospital? Perhaps Jess Keating came in on a gurney in critical condition, and although we know her name from the student ID card on her person, we don’t know her address and cannot ask her for it. We need to be able to add her to the system without an address.
This might seem like an extreme case, but unknowns pop up all the time in the real world. Jess could just as easily have shown up as an outpatient and simply neglected to give her address before being called in to see the doctor. Or perhaps she was giving her address over the phone but got cut off.
Why not just leave her address as blank or NULL, then? The absence of a value isn't the same thing as a conflicting value, is it?
Well, this depends on context, and for the purposes of our example, a non-existent address is a value. Moreover, it's an incorrect value, as Keating does live somewhere.
It is not in our best interest to populate tables with false information, and yet the practice is pretty common in the real world. This is a simplistic example compared to the kinds of unknowns that routinely surface in high-volatility industries like finance and insurance, where the repercussions of entering a false empty or NULL value can be costly indeed. In these cases, insufficient data is preferable to the alternative—even though it lacks completeness—because it is more accurate and therefore more appropriate to the business context. We have progressed from a referential integrity issue to a data integrity issue.
Stage 2: Data Integrity
Where in Stage 1 we were most concerned with the integrity of our table references, we are now going to loosen those requirements in order to prioritize data integrity—that is, the internal consistency and lack of corruption in the data itself.
Let’s refer back to our hospital example for a moment. If Jess Keating is currently undergoing a life-saving procedure in our emergency room, it is important that we have record of her, even if that record is incomplete. Failure to record her visit as accurately as possible could have catastrophic consequences, so in order to bypass our database’s referential integrity check, we need to implement an override.
In our case, the override would take the form of a stub record, a temporary stand-in for the real address we don’t yet have.
This dummy record satisfies the database’s referential integrity requirements while simultaneously accounting for the fact that we have a patient by the name of Jess Keating with no known address.
But now we have another data quality problem on our hands. Jess Keating’s address, by being blank, is effectively incorrect. Ordinarily, no data entry would be preferable to false data entry, but in this case, circumstances and the referential integrity constraint force our hands. The stub record must eventually be corrected, not only to restore the record’s integrity but also to keep from triggering a series of patient processing issues. The hospital needs to instate a system that will ensure that the record gets filled. We now progress from data integrity to data quality.
Stage 3: Data Quality
So far, we’ve gone from determining that our tables are accurate to verifying that our data is accurate, but to assess all the factors contributing to data quality—accuracy, completeness, validity, and relevance among them—we must actually understand what the information is for. This, in the data world, is referred to as “semantics.” Whoever designs the processes that will ensure Jess Keating’s address is collected and added to the database will have to understand who needs that information and when and why. We are all familiar with the utility of an address, but not all data points are as transparent to the layperson. Data management teams need to understand what the data means and how its being used in order to safeguard its quality.
In our example, the data quality check might be that the receptionist at the ER’s discharge desk be required to fill in any vacant fields before the patient is released. The hospital could rely on receptionists to scan all the necessary paperwork and identify where information is missing, but this leaves a lot of room for human error. An application programmed to perform this scan on the receptionist’s behalf is better equipped to catch all the missing information, and it is this program that a data quality analyst might design based on the hospital’s processing needs.
The need for an automated quality check compounds when we introduce more complex data quality concerns, such as duplicate records. Let’s say this isn’t Jess Keating’s first visit to the hospital; she’s already in the system under Jessica Keating. We don’t realize this at first because all we have is her student id, which identifies her as Jess. When Jess is discharged from the ER and we collect her address, it’s possible we’ll get something different than what we have on file for Jessica; she may have moved recently.
In this scenario, the patient’s social security number would lead us to consolidate the duplicate records, but not all real-world scenarios will have such a convenient unique identifier! In less certain cases, data stewards may need to run a program checking for patterns in multiple fields (e.g., records with the same address, same last name, and same first letter of first name), flagging the potential duplicates for investigation.
Subject matter experts and data specialists need to work together to ensure data quality at the highest level. Referential integrity and data integrity are merely stepping stones on the path to a top-quality data system that can accommodate real-world information requirements.
Originally published on Springboard.