The IT department has just migrated 400,000 accounts from the legacy ERP system onto a new, bright and shiny, system sold by a large software vendor, but the Sales team is mortified. The Sales team has discovered that all of the Sales Notes fields (in addition to others) have been Truncated to 255 characters. The sales agents use the end of this text field to record all of the "juicy" (or current) leads and now this valuable information is no longer accessible...
We’ve been there, seen the effects of data truncation during a data migration, but how could the stakeholders have collaborated using detailed descriptions of expected quality in order to ensure a successful outcome? I find it works out best if the level of quality expected of an attribute is communicated among all parties. But how do you do that with such short timelines and without a list of commonly understood definitions?
Answer: Use the Conformed Dimensions of Data Quality which spells out categories called Dimensions and sub-topics called Underlying Concepts. For example, in the case above, if the data consumers of the Sales_Notes field (both Sales Agents and downstream users) had identified their data quality Service Level Agreement (SLA), just like for other IT services, they could have chosen to include the following Underlying Concepts (see outline below). This would have alerted the data migration team to preserve the length of the sales notes field.
The Completeness dimension within the Conformed Dimensions holds the following underlying concepts.
•Record Population- This measures whether a row is present in a data set (table),
•Attribute Population- This measures whether a value is present (not null) for an attribute (column),
•Truncation- This measures whether a data set includes all of the rows required to sufficiently represent the facts that the data set portrays,
•Existence- Existence identifies whether a real-life fact has been captured as data.
In a prior post, Record Population was discussed. In today's post we’ll discuss Attribute Population and Truncation concepts. The data migration team described in the scenario above will likely defend their position by stating that the sales notes column was moved into the new ERP system (which is factually true), and may argue that the vendor’s software does or does not enable notes longer than 255 characters. But imagine if the stakeholders on all sides would have known that this field expects full “Completeness” as defined by all of the underlying concepts of the dimension. This would have included Truncation, which measures “whether the value contains all characters expected.” If it isn’t clear that the number of characters expected is all available in the legacy system, then additional expectations should be set at the stakeholder level.
This might even mean that depending on the context, different consumers (Marketing department, for instance, might only need the first 100 characters, but the Sales department may need the full 255 characters). I know, at this point you’re saying that is too much work---to define your data quality levels by department using some fancy framework of descriptions for quality…
I feel your frustration, but from my experience, many organizations have a few data savvy users that know exactly what they need, and if asked, they are able to select from a list of available fields and identify those they use. This means they just need a drop down selection for the quality levels expected for those fields. That is what the Conformed Dimensions provides- a way to overlay you data quality requirements on each variable consumed.
What do you think? Email your thoughts to the author of this blog, Dan Myers (dan[at]DQmatters.com).
• Would the Conformed Dimensions work for you (your organization)? Why or why not?
• Does your organization measure Truncation and under which dimension does it categorize this concept?