The quality of data cannot be stated in absolute terms as it is relative to the intended application.
The same set of data may be considered to be of good or poor quality based on a particular situation or application. Hence the saying that “all accurate data are not created equal.”
Data is considered to be of “high-quality” if it is suitable for its intended use. Before we can judge whether it is truly accurate, we need to consider the correctness, timeliness, relevance, completeness, and whether it is easily understood and trusted.
How correct is it for the purpose?
For example, consider a database of accountants, which consists of names, telephone numbers, email addresses, and mailing addresses of the CPAs. You are aware that there are errors in the records — some names or emails or phone numbers are wrong. Let’s say that about 20% of the records have errors in them. Is this good quality or poor quality data? If this database is to be used to inform the accountants about a change in regulatory compliance then the accuracy of the data — 80% — is too low. On the other hand, if the same database is to be used for a marketing campaign, the accuracy may be satisfactory.
Timeliness also affects the accuracy of data
The quality of data can be affected by the time when it’s captured and how frequently it’s updated. Let’s consider a situation where you are collecting data from a sales team in order to report the status of the sales pipeline. Individual salespersons need to manually enter various details. Some of them are updating these details in a very timely manner, but there is a considerable delay on the part of some others. At the time when your report is generated, if the data is incomplete then its accuracy is too low for reporting. Reports based on such data could be misleading. However, subsequently, once all the relevant data has been added, the data may be considered to be of good quality. In this way, timeliness affects the accuracy of the data.
Is the data relevant to the objectives?
A dataset that contains correct data may not be suitable for a specific requirement if it doesn’t capture the metrics that are relevant to the objectives. For instance, let’s say we are conducting a study to understand why some patients who are discharged after a hospitalization need hospitalization again within a few weeks. For this study, we collect data about many different health parameters for our study. However, we don’t capture the name of the hospital in which each patient was admitted. This means that if there is any correlation between the hospital and the need for re-hospitalization, it would not be found from our study. In this case, the data that we have used may be correct, but it is not fully relevant, and so it’s of poor quality for our objective.
Completeness
A data set that is partially filled can skew analysis and produce misleading results. As an example, let us consider a product distribution manager who gathers data about sales and stocks from various retailers. An analysis of this data enables decisions related to stocking and distribution logistics. However, some retailers capture the required data diligently while others do so only partially. In this case, although the database is correctly designed to capture all relevant metrics, as data entry is incomplete, the available data cannot be used for decision making, and is, therefore, poor quality.
Can users understand the data?
A database that is not organized in an unambiguous manner so that users can easily understand it is not helpful for decision making. Let’s say you are looking to analyze sales trends over different seasons and months, but you aren’t aware that sales are booked only after payment is recovered, and the time between order booking and revenue realization could be 6 to 8 weeks. So when you receive the data and start analyzing sales trends over months and seasons, you will not gain meaningful results. This has happened because you had no way of understanding the nature of the data. In this case, order booking should be reported as well as revenue realization for the data to be clearly understood. In its current form, the data is not sufficiently accurate for your purpose.
Can the data be trusted?
Data analysts can confidently analyze and present findings only when they can trust the data they are using. In order to decide how much they trust the accuracy of the data, they may need to know the source, data collection methods used, and when it was collected. The organization that gathers or publishes the data could also have a reputation and be perceived as being trustworthy or not. Data that can be trusted is of good quality, while that which is viewed with some suspicion is not good enough for the purpose.
As we saw, the accuracy of data depends on the objectives for which it is being used, so ‘all accurate data is not created equal’. Data that is correct, timely, relevant, complete, easy to understand and trusted is high-quality data that helps to achieve defined objectives.