Data Quality Management

Data is just a four-letter word. However, we all know the value it contains and how big it is. If you don’t agree with me, let me share some of the interesting facts about the data reported by Forbes in their recent article.
- The data volumes are exploding; more data has been created in the past two years than in the entire previous history of the human race.
- Data is growing faster than ever before, and by the year 2020, about 1.7 megabytes of new information will be created every second for every human being on the planet
- Facebook users send on average 31.25 million messages and view 2.77 million videos every minute.
Data is an individual unit that contains raw material which does not carry any specific meaning. Good data allows organizations to establish baselines, benchmarks, and goals to keep moving forward. Improved data quality (DQ) leads to better decision-making across an organization. The more high-quality data you have, the more confidence you can have in your decisions. Good data decreases risk and can result in consistent improvements in results. For example, if the data is collected from incongruous sources at varying times, it may not actually function as a good indicator for planning and decision-making. High-quality data is collected and analyzed using a strict set of guidelines that ensure consistency and accuracy. In this blog, I tried to expand and explained data quality dimensions.
Data quality dimension refers to an aspect or feature of information that can be assessed and used to determine data quality. The following are six key data quality dimensions.
- Accuracy — Data accuracy is one of the components of data quality. It refers to whether the data values stored for an object are the correct values. To be correct, data values must be the right value and must be represented in a consistent and unambiguous form. Incorrect spellings of product or person names, addresses in the data set make data inaccurate.
- Validity — Validity in data means that your findings truly represent the phenomenon you are claiming to measure. An example of incorrect classification values could be gender or customer type in the data.
- Timeliness — Timeliness refers to the time expectation for accessibility and availability of information. Timeliness can be measured as the time between when information is expected and when it is readily available for use. An example could be customer address change which is effective on July 1st is entered into the system on July 15th.
- Completeness — Data completeness refers to the degree to which all data in a data set is available. A measure of data completeness is the percentage of missing data entries. For example, Customer address missing zip code results in making data incomplete.
- Uniqueness — A discrete measure of duplication of identified data items within a data set or in comparison with its counterpart in another data set that complies with the exact information specifications or business rules. Let’s say a single customer is recorded twice in the database with different identifiers results in data duplication and removes uniqueness.
- Consistency — Data are represented consistently across the data set. Think of a closed customer account, but there is a new order associated with that account. This is an example of inconsistency in captured data.
The following four-step process helps to enforce and maintain data quality.
- Define DQ Requirements — This includes performing data profiling. Data quality profiling is about getting statistical information about data In order to help discover value frequencies or formats of data. Data profiling can be performed using specialized tools or query languages supported by data sources. Although some data quality problems can be discovered during data profiling activity, data profiling aims to give insight for data quality assessment.
- Conduct DQ Assessment — As the name suggests, it includes defining data quality rules (for accuracy, validity, timeliness, etc.) and quality thresholds. Once rules are defined, perform data quality assessment by enforcing DQ rules on the existing dataset and identify data quality issues. Make sure you update your issue log with existing data quality issues.
- Resolve DQ issues — Perform root cause analysis to identify real issues impacting data quality. Once real issues are identified, take corrective actions to eliminate the root cause, including reviewing and fixing data policies, procedures, and/or fixing the data input/ingestion activity.
- Monitor and Control — Define and populate data quality scorecards and monitor data quality.
There are several tools available in the market which can be of use to maintain data quality. Typically, these tools perform data profiling, assessment, resolution through data transformation and provide visibility through scorecards/dashboards. The following are the high-level requirements when deciding the right tool. However, you might find that specific tools might not be a good fit though it fulfills all your requirements because of the data source compatibility, Integration capabilities, tool complexity, cost, compliance, etc.
Key metadata tool requirements are
- Ability to conduct data profiling, including statistical analysis of data sets.
- Ability to define and execute data quality rules for critical data elements, subject to data quality check.
- Ability to store data quality profiling and assessment results.
- Ability to conduct issue resolution processes and discover issue patterns.
- Ability to create and visualize data quality scorecards.
If you are working on a data quality management project or planning to implement data quality, consider these basics on data quality. I hope this article clarifies your questions about what data quality is and how to main it. Please share your thoughts in the comment section below.
Disclaimer: I work for @Microsoft Azure Cloud & my opinions are my own.