Why Completeness as a First Step

Why Completeness as a First Step

When faced with a number of data quality issues, and urgent stakeholder requests to improve quality, it can be tempting to dive right in and clean the data. This however doesn't get to the root of the problem and one finds that many of the same types of problems have to be resolved over and over again. This can be not only ineffective, but also demoralizing when staff have to spend so much time to get back to where they started.

More often than not, a foundational step to improve quality in a strategic way, focusing not only on fixing processes and solving the root cause up stream at the source, but specifically with a focus on one dimension of quality--Completeness.

Now before we go too far, let's look back at the definition of Completeness (using the Conformed Dimensions of Data Quality) in order to see what underlying quality concepts exist.
Completeness Definition: "Completeness measures the degree of population of data values that exist in a data set. (example: columns and rows)."

As seen here, Completeness is focused on the existence of the data, and as such it underpins quality as a whole. If the data doesn't exist then it's impossible to critique it along other dimensions such as validity, integrity and others.

So where do we start in terms of measuring Completeness? Well, this is where the Conformed Dimensions come in handy, each dimension has documented underlying concepts that explain a phenomena. Here are the concepts within Completeness:
 •Record Population,
 •Attribute Population,
 •Truncation,
 •Existence

In future posts, each of these topics will be addressed, but in brief, the quality of a set will never be as desirable to consumers if any of these aspects aren't of the level of quality needed. This is particularly obvious with Record Population, for instance, where whole rows are missing.

It isn't always the case, but let's assume a table of data is based on some sort of normalization, or intended granularity, if a whole row is missing (not just a few attributes in that row) then likely the ”thing" represented in the row (like a car in a table listing vehicles in a fleet) is missing from your analysis. This lack of the full picture, may lead to the detriment of your over arching objective at hand. So if you want to know how much your fleet is worth, or how big (because you have to move it) your analysis will be incorrect due to the lack of that missing row.

It is favorable to focus first on Completeness of quality, answering the question about whether you have the full picture in terms of rows. Only with this in hand, is one able to then evaluate Column Population, which supports a number of other types of evaluating data quality such as Validity, Currency and ultimately Accuracy.

What do you think? Email your thoughts to the author of this blog, Dan Myers (dan[at]DQmatters.com).
• Do you think Completeness is a good place to start the data quality journey? What examples of this do you have from organizations following this strategy or another dimension focused strategy?