The various kinds of data that are generated through a variety of sources at practically every moment are categorized as structured, unstructured or semi-structured. In our previous post, we saw that unstructured data is usually complex and heterogeneous and cannot be mapped to a predefined structure such as a data table or a relational database.
Let’s take a look at 8 examples of unstructured data to better understand the source, character, and importance of each.
- Medical records: Healthcare generates large volumes of machine as well as human-generated unstructured data. Machine-generated data includes data collected by medical imaging devices such as endoscopes, laparoscopes, surgery robots, emergency video cameras, and thoracoscopes and biosignal data from patient monitors in operating theaters and intensive care units. Wearable health monitoring devices generate a plethora of data, too. Human-generated data could be the conversations between patients and healthcare professionals that are recorded in the form of text or as audio files.
While medical data is growing exponentially, a bulk of it remains unused, mainly because healthcare-related information systems are not equipped to process unstructured data. If the capabilities to interpret, analyze and utilize this unstructured medical data were available, the benefits would be tremendous — both for patient treatment, as well as for public health management and medical research.
Currently, there is considerable interest in applying artificial intelligence to improve diagnostics, patient care, public health and pharmaceutical research. The success of AI systems will depend on the availability and quality of data, making the collection, anonymization, and cleansing of data very important. Standardized metadata for each type of unstructured medical data would also be useful and enable data integration. - Social media: Social media has become an intrinsic part of the lifestyle for billions of people around the world, and for many, it is the preferred channel when it comes to viewing, creating, or sharing information. Social media is also used by businesses, governments, and organizations across domains such as shopping, entertainment, education, crisis management, and politics. Social media platforms generate data at every moment, round-the-clock, all over the world. This has led to a huge proliferation of data that could be in the form of text, images, videos, audio, or geo-locations.
Both structured and unstructured types of data are created from social media use. The text in a social media post is unstructured data, while information about friendships, followers, groups or networks is structured. The full spectrum of social media data has enormous potential for providing rich insights into perceptions, behavior, trends, influencers, events, news and more.
In order to utilize social media data, we can extract it using tools that communicate with the APIs of the social media platform. When we seek to harness social media data, we typically encounter a set of challenges that are called the ‘4 V’s’.
Volume: The computing power and storage capabilities required.
Velocity: The speed of data creation and the need for real-time analysis.
Variety: The many different types of data.
Veracity: The need to verify the data quality.
A variety of techniques have been developed in order to work with social media data, such as topic discovery methods and event detection algorithms. Software systems need to have an appropriate architecture to handle this type of data. Once this data is effectively analyzed, brands are able to apply the insights gained to offer targeted and personalized products, services, and deals to consumers. - Business documents: The multitude of documents that are used to conduct business, such as emails, presentations, and reports contain data in the form of text, images, numbers or video and are unstructured. These documents form important knowledge repositories within the organization, but currently, they are mostly underutilized if they cannot be mapped to structured information systems.
Let’s consider some ways in which business documents could be analyzed and utilized. A bank may find qualitative information that is relevant to understanding a borrower’s creditworthiness from emails, or a legal team may be able to quickly find certain ‘red flags’ in a draft contract, or a recruiter can analyze and process applicant resumes more easily.
Some of the techniques that are being used or explored for document analysis are digital imaging technologies, pattern recognition to interpret image and video content, and document understanding — which combines natural language processing (NLP) and machine learning (ML) to help understand unstructured natural language text. These techniques help to understand the contents of large volumes of documents that cannot be effectively processed manually. - Images, video and audio media content: The media and entertainment industry, surveillance systems, professional publishers and even individuals are constantly creating image, video and audio content. These media files are often stored in structured databases, but such databases do not process or understand the actual contents of the media files, which are in the form of unstructured data.
The ability to interpret and understand media, often in real-time, has far-reaching implications for governance, business, and healthcare. Some examples are: an analysis of 911 call records could aid criminal investigations, CCTV camera footage could help to prevent or detect incidents, identifying the persons in a video could be useful in news reporting, and videos of shoppers could help retailers to understand their movements and shopping patterns.
Considering the huge volume of data involved, analyzing the content of media files manually is a daunting task, which is why automation solutions are currently being developed. For example, natural-language processing can extract text out of audio files using speech-to-text technology, and the text can be analyzed to perform sentiment analysis. Metatags are also helpful to classify media files and perform search operations. - Communications – live chat, messaging and web meetings: Today, professional, as well as personal, discussions take place across a variety of communications platforms. Popular apps such as WhatsApp, web conferencing platforms such as Zoom or Skype, and collaboration tools such as Slack are some of the places where data is being created in the form of unstructured audio and text. Consider an organization where employees are speaking with customers and vendors across multiple communication platforms. In order to get a unified view of a particular customer, there is a need not only to integrate unstructured data created on different platforms, but to standardize and interpret it.
Customer sales or service calls can be stored, categorized, transcribed and analyzed to find meaning. A speech recognition program converts voice to text, and emotion detection capabilities observe the tone during the call through changes in the customers’ speed, pitch, and volume. Natural language processing helps to identify key themes, products, and sentiments, equipping the organization to improve the customer experience, retain customers and enhance sales.
An increasing number of websites and apps are offering visitors a live-chat functionality. Chat conversation transcribes are a treasure trove of market intelligence if analyzed correctly. This is where data visualization tools can play a role in helping discover key themes. Chat data gathered over time helps to understand trends — i.e. whether a topic is becoming hotter or cooler by the day. This knowledge can go a long way in building deeper relationships with customers. - Survey responses: A questionnaire to conduct market research or employee engagement typically includes multiple-choice and open-ended questions. Responses to such open-ended questions are unstructured text. These questions are important as they help the researcher discover aspects they may not have considered, gain a deeper understanding, and build a connection with the respondent.
However, as unstructured data, it can be challenging to process open-ended questions and interpret the responses correctly. Before technology tools were available, researchers used to classify the responses by grouping similar responses into ‘themes’. This was a cumbersome process and did not always facilitate an accurate interpretation of the responses. This has now changed, as we have text analytics technology to interpret unstructured open-ended responses. By applying automation that uses natural language processing and AI, we are now better equipped to analyze open-ended responses with ease. - Publications and listings: Publications, directories, and portals publish a variety of content that is in the form of unstructured data. These include news stories, job listings, movie reviews, real estate listings, restaurant reviews, resume databases, invitation to bid for contracts, etc. Each of these includes text or image information that is unstructured. As this is derived information, not raw information that is available with you, you need to extract and store it.
Data from publications and listings could unlock powerful business opportunities. An investor can make more informed investment decisions, or an organization can make strategic plans, based on news analysis. Once data has been captured, stored, and analyzed with the help of NLP and AI, an organization can utilize that which is most relevant and valuable to them. - Webpages: All webpages contain content in the form of text, images, videos, forms and functionalities that are unstructured data. While a webpage is rendered by HTML, the code does not capture the meaning of the content on the page. Businesses and individuals may want to know and analyze the content data as webpages reveal important information about opportunities, customer behavior, competitors and more. When extracting website data, it is important to capture the date and time, as websites are dynamic and may change often.
As we saw from the above examples, because unstructured data touches practically all aspects of business, we are constantly dealing with an increasingly large number of unstructured data sources and volumes. Being equipped to analyze unstructured data and apply the learnings to business is key, as it can yield critical strategic and operational insights that help to make informed decisions.