Did you know there is more than one type of data? We examine structured vs. unstructured data in our article.
Data enables understanding, analysis, and informed decision making – whether in business or government or even for individuals. Traditionally, we’ve always organized and standardized data and mapped it to predefined fields. Mature systems and technologies exist to store and analyze such structured data. But in reality, there are many types of unstructured data that cannot be contained in tables but would yield rich insights, were we to find a way to work with them.
It’s important for us to know the nature of both structured and unstructured data, in order to get the most meaningful information from both. Let’s start by looking at structured data.
Structured data:
Data that fits into predefined fields is called structured data. It’s often organized by rows and columns, as you will find in a spreadsheet. Larger systems use relational databases to organize structured data that has multiple dimensions. Structured data is convenient to store, analyze and report.
As an example, consider the sales report of a business, that shows sales revenue by region, by product, by month, and so on. Customer databases, financial reports, economic data, health records, and even educational records – all are examples of structured data. Until now, most of the data we use has always been structured data.
If structured data has worked well for us until now, why are we working more and more with messy, unstructured data? There are two reasons. The first is that most data in the world is unstructured – some studies suggest that 80% to 90% of all data is. The other reason is that advanced analytics technologies now make it possible for us to process and gain insights from unstructured data. So let’s look at what is unstructured data.
Unstructured data:
Data that is complex or heterogeneous and cannot be fit into standard fields is unstructured data.
As an example, consider if you were doing an analysis of social media posts from your city, about eating out. Data may be in the form of photos on Instagram, plus videos on Facebook, plus the text of the posts on all platforms.
Examples of unstructured data are photos, video and audio files, social media content, satellite imagery, presentations, PDFs, open-ended survey responses, websites, data from IoT devices, mobile data, weather data, and conversation transcripts.
Unstructured data may be in the form of text – such as the content of emails – or it could be bitmap objects such as image, video or audio files
Unstructured data is usually stored in a data lake. This is a storage repository where a large amount of raw data is stored in its native format. To manage unstructured data, NoSQL databases replace relational databases as they can handle data variety and large amounts of data. Once we examine structured and unstructured data, we realize that there’s a third kind, and that’s semi-structured data.
As an example, consider the business of retailing books. Traditionally, a book retailer’s analysis of business operations would have been based solely on structured data, such as sales figures – numbers sold by title, author, publisher, month, region, and so on. Today, an online bookstore collects far more types and volumes of data, both aggregated as well as about individual customers. For a particular title, the online store knows how many people are searching, how many have reviewed, the ratings given and the reviews posted. If the store wants to analyze reader feedback, the rating given is structured, and the text written in the review is unstructured.
This kind of data is called semi-structured, so let’s look at what that means.
Semi-structured data:
Data that does not have a rigid structure, but has some defining and consistent characteristics, is called semi-structured data.
As an example, consider emails. While the text content of the emails is unstructured, each email does have standard fields, such as date, sender name, receiver name, subject line, and so on.
Semi-structured data has some characteristics of both structured and unstructured data.
Unstructured data within enterprises is estimated to be growing much faster than structured data. So businesses that do not design ways to collate, store and analyze unstructured data may be missing out on valuable business intelligence and losing a competitive edge.
According to Harvard Business Review, cross-industry studies show that on average, less than half of an organization’s structured data is used in making decisions and less than 1% of its unstructured data is analyzed or used.
The reason for the low utilization of unstructured data is, until now we did not have a way to handle and analyze unstructured data, and would mostly disregard it. A customer satisfaction survey would be designed as a multiple-choice questionnaire, as there was no way to analyze the flowing text responses to open-ended questions. Today, the technology exists to process such qualitative and unstructured data and make meaning out of it.
Advanced analytics using natural language processing can help to glean the meaning and tone, or ‘sentiment analysis’ of social media posts or survey responses. Artificial Intelligence (AI) pattern recognition algorithms can sort images. A variety of rich insights become possible when we work with unstructured data.
If you offer an online service, such as a travel or eCommerce portal, then data is being generated each time anyone interacts in any way with your portal – from clicking on an ad, to landing on a page, to chatting in the live chat window, to searching on your site, to making a booking or buying a product, and beyond. Spreadsheets cannot possibly capture, analyze and report all these different types of data. It is a mix of structured data – such as the number of visitors to the site, and unstructured data – such as queries typed into the live chat window. Data science experts and systems that can handle such diverse data sets, and are able to juxtapose structured data with unstructured data to discover meaning will be really valuable business assets.
We will be seeing more powerful applications of analytics on structured, unstructured and semi-structured data as organizations get better equipped to handle all kinds of data. Businesses that can generate the right interpretations by overlaying structured and unstructured data will have a substantial edge over others. A unified view that integrates both types of data will deliver clarity and strategic advantage.