Is big data always better data? In other words, is it always helpful to have more data? Not necessarily — and here’s why.
From the time we first got excited about big data, we were told that the magic lies in the ability to find meaningful insights from large data sets, or data so voluminous that it was impossible to process it until modern computational and software tools became available. But when we examine this a little further, we need to consider whether we really get the best analytics simply because the data is large.
The sheer size of big data can make it difficult to handle. Sometimes the goals and expectations are too large, the effort to manage and process is too much, and the insights are hidden below too large a heap. Does smaller data make more sense in some situations?
The bigger the data, the bigger the effort to manage it. What does this mean? Large volumes of data may not be usable due to problems such as inconsistent formats, incomplete data, etc. Big data experts who are expected to uncover strategic insights for the business are forced to spend their time in data cleaning and management just to make it usable.
Larger volumes of data also mean that you need more security measures in place and will spend higher amounts on data storage.
Sometimes, less is more
There are some situations when it is far more helpful to have smaller data sets that are collected with the end objective in mind. Analysts then know the right models that should be applied in order to find patterns, trends, causes, and correlations.
Let’s consider companies that have implemented IoT systems and are receiving a large number of machine readings from equipment at frequent intervals. In the absence of a clear plan as to how these readings should be used, if they are simply stored with vague ideas of revealing transformational insights in the future, it is quite likely that no tangible result may be achieved.
On the other hand, if the IoT data is planned with specific questions and objectives in mind, then a smaller number of parameters can be collected. Analysts have a clear idea about what they are looking for so they can apply appropriate modeling techniques. So, as we see in this case, big data is not always better data.
If data collection and storage are not limited to such “smart data”, but rather gather huge volumes of big data, then a large chunk of it becomes “dark data” — i.e. data that is unlikely to ever be used.
There are other situations that require large data sets. For example, in public health, if the objective is to track the spread of a particular disease and stop it from spreading further, or in engineering, if the performance of a jet engine has to be studied, then big data analytics will prove to be extremely powerful. The key, then, is to use as large a dataset as appropriate for the specific objective, without drowning in data.
Machine learning algorithms need meaningful data
When we apply machine learning technology to uncover patterns, trends, and relationships or predict future events, we supply the machine learning algorithms with data. Larger data sets are not always more helpful in this situation. The algorithms need meaningful datasets which include those fields that are relevant for the specific objective. Feeding the algorithm with huge volumes of irrelevant data will not help build the right ML capabilities.
We see from the above examples that it is smarter, not bigger, data that is of critical importance. Huge volumes of data does not necessarily translate to business benefits if you don’t know what you’re looking for, not looking at the right data, or not using the right tools.
One solution is to create a data map — clearly mapping what data you are collecting, from which sources, for what purposes, how it is being stored, and who can access it. Businesses would be wise to focus on smart data that has a definite purpose and avoid investing time and resources on others.