A few years ago I picked up a book called Zen and the Art of Motorcycle Maintenance. A friend had recommended it to me with very few details other than, “You would like it.” What I found was not a manual about keeping your cool while tackling tricky motorcycle problems, but a treatise on quality. The author, Robert Pirsig, wove a beautiful tale of a man driven mad attempting to understand “Quality” and his cross country motorcycle road trip towards reconciliation. My friend was right, and I started thinking about how quality impacted everything. How we evaluate things and make decisions is all built on the foundation of Quality.
Recently, I was engaged to assess quality of data used by a reporting application. Pirsig defined quality as the ability to discern between “good” and “bad” at that particular moment in time. There is the Subjective view based on “personal opinions, assumptions, and beliefs,” and there is the Objective view based on “observations of measurable facts.”
With my limited business domain knowledge, I knew it was too early to look at the data using a Subjective view, but the Objective view would give me the insights required to discern the “good” data from “bad” data. I chose to concentrate my research to three Objective attributes of quality: Completeness, Consistency, and Accuracy.
When assessing something for quality, we must first see if what we wish to evaluate is present and whole. When looking at a particular data set it’s important to check, sometimes even at a column level, if the data you are looking for is complete.
Take, for example, a customer table that has 100 records, but the “Last name” column is only populated for 75 of those 100 records. We have a significant “Completeness” data quality issue where we are missing 25% of our customer’s last names in our customer table. This is a simple test, but a critical metric that may reveal larger problems in the sourcing, processing, or insertion of data.
It’s important to remember that “Completeness” quality will depend on the field you are looking at. In the above example we could expect “Last Name” to be populated at least 95% of the time, and raise alarms for data sets that had “last name completeness” lower than that. However, we may expect the “Middle Name” field to only be populated 25% of the time. We may find that “Middle Name” is only populated 15% of the time, but the field is deemed optional/lower priority and thus unimportant.
Once we’ve identified if data is present, we need to validate that each unique value of data represents and communicates unique information. Imagine we have a product table with a column that describes the color of a product. For simplicity sake, the product can be one of four colors: Red, Green, Blue, or Black. When we start looking at the data, we notice values like this:
You data analysts out there are smiling with familiarity right about now. It’s a simple enough data set, but it has a myriad of consistency issues. See the last column below:
Out of the six values above, only one of them is perfect. The remaining data that has consistency issues may indicate problems with the sourcing or processing of data.
When evaluating for Consistency it’s important to evaluate each value for its purpose. Does “0”, NULL, or “blank space” convey the same thing, or is there a deliberate reason as to why a field may have a “0” sometimes, and be NULL others? Is there a more appropriate and distinct value we can use to represent that information?
Assessing the quality by “Accuracy” is often the trickiest and most involved. This is because when we are trying to deem something as “Accurate” we find ourselves evaluating quality “Subjectively”, with our opinions and experiences and not “Objectively”, with data. In truth, both perspectives are critical in assessing overall quality of data, but the “Subjective” perspective requires in-depth knowledge of either the business (and sometimes technical) processes behind the data.
A great place to begin “Accuracy” analysis would be from the “Objective” viewpoint. First, evaluate each data point from a high level. Ask, “What would we expect the data to show?” This will often require the input of a business user or SME. If you don’t have access to those, break out the old statistics textbook and identify the mean and standard deviation of specified values. For example: if the SME says that 95% of the time, they expect the “Total Paid” field on a “Sales” table to be between $25 and $100, and we find that 40% of the time the data is greater than $120 then we may have a significant accuracy issue. Another example of objectively assessing “Accuracy” would be looking at dates or timestamps. It may be reasonable to assume that a date is supposed to represent when a particular action happened wouldn’t be in the far past (say before the company or even computers existed) or in the far future. To be fair, the data may be correct, but the SME’s understanding of the data may have to be readjusted given the findings.
Another method to assess accuracy is to take a look at the “metadata” of the data you are assessing and establish expected patterns. For example, a “Product Name” is expected to always be two letters followed by a three digit number (example: DQ321). However, when we look at the metadata we might find that the majority of product names are three letters followed by two digits (DQP21) and we might consider that an issue with the “Accuracy” of that data point. Again, to evaluate for good or bad, we would require either input from a business process SME or derive the “norm” from statistical analysis.
Finally, another way to objectively assess quality of the data is to look at the relationship of data in relation to events. For example, if we are looking at a transactional system, we may expect there to be a life cycle in place:
- A customer makes their selection
- A customer pays for their selection
- A customer leaves with their selection
Our data might tell one of the following two stories:
- A customer pays for their selection.
- A customer makes their selection.
- A customer leaves with their selection.
- A customer makes their selection.
- A customer pays for their selection.
This doesn’t make sense (unless they were a terrible thief with a guilty conscience). By classifying and identifying the frequency of these incorrect life cycle events, it is possible to assess the “Accuracy” of the data present. The above scenario is a simple one and most likely indicates an error in the processing or insertion of the data (hint: check your source date fields and time zones). However, large frequency of other life cycle event “inaccuracies” may representing what is actually happening in real life and indicative of a larger issue with process or policies upstream of the data. Hence, the importance of partnering with the business SMEs to identify and classify scenarios that would indicate “good” or “bad” data quality.
You’ll notice that as we began to evaluate from Completeness to Consistency to Accuracy, we started to move away from purely “Objective” evaluation to “Subjective” evaluations based on input from business users and SMEs. As we delved into the “Accuracy” attribute it became critical to partner with the business in order to make sound “Quality” judgments. Ideally, throughout the entire “Objective” evaluation process, the quality analyst would have been getting comfortable enough with the data and processes behind the data that they could begin to make sound “Subjective” judgments on the quality of the data. The more we live in the data, the easier it is to identify “good” data. One of my favorite quotes of Pirsig’s echoes this sentiment: “Care and Quality are internal and external aspects of the same thing. A person who sees Quality and feels it as he works is a person who cares. A person who cares about what he sees and does is a person who’s bound to have some characteristic of quality.”