A vast majority of the data that is generated in the real world is unstructured and is vital to further our understanding of the world. While the analysis of structured data can help us to know what is happening, it is unstructured data that may reveal why. Because unstructured data doesn’t fit neatly into the row and column structure of a data table, we cannot use standard numerical or statistical analysis methods to handle it. Indeed, there are many challenges related to identifying patterns, trends, and meaning from unstructured data.
How, then, can we analyze unstructured data? While processes and technologies to analyze unstructured data are fairly new and rapidly evolving, recent advances in machine learning and artificial intelligence are showing tremendous promise in this area.
But before you can start with analysis, you need to identify relevant data sources. While there are always multiple sources of data available, it’s important to use those that are most meaningful for your specific objectives. You may need to eliminate unnecessary data or noise — i.e. anything that is not relevant to the objectives. You also need to identify suitable technology tools for data collection, cleansing, storage, processing, analysis, and presentation. You may choose to store data using data lakes, which enable unstructured data to be stored in the native format along with associated metadata.
Once these steps are in place, you can plan your data processing and analysis methodology. Let’s take a look at the best ways to analyze unstructured data.
1. Metadata
Metadata, the data that provides information about data, plays an important role in the management, storage, and analysis of unstructured data. We saw that object storage systems are suitable for storing unstructured data in our previous post, How is Data Stored? In these systems, data is stored in an object that contains the data, metadata, and a unique identifier. Most file types have several metadata fields that can be filled in.
When you take a photograph using a digital camera or smartphone, each image has metadata associated with it, such as date, time, filename, and geolocation. Each blog post has metadata that includes title, author, URL, date of publishing, tags, category, etc. A webpage has metadata such as page title, URL, page description, and icon.
In addition to these standard fields, you can define additional custom metadata fields based on your requirements to indicate the nature or contents of the unstructured data. In this way, metadata can help to facilitate subsequent search and analysis.
As there are currently no industry-wide standards on metadata, each enterprise needs to define their own. Using metadata effectively helps to organize, automate, enforce policies and gain visibility into the data. While it is best to associate metadata at the time when the data is created, that does not always happen, so metadata may have to be added later.
2. Natural Language Processing (NLP)
Natural language processing (NLP) is a machine learning methodology that helps to analyze the meaning of unstructured text data. NLP simulates the ability of the human brain to process natural languages such as English, Spanish, Chinese, etc. NLP can infer the meaning of text data in a context even when documents do not follow a standard template. This is done based on semantics and grammatical relationships.
Let us look at some of the models used by NLP to process unstructured text:
- ‘Bag of words’:This simply counts the occurrence of a word or phrase in the text, without considering semantics or context.
- Tokenization is used to break text into tokens. Tokenization cuts text into sentences and words and disregards punctuation.
- Stop words removal is the process of removing articles and prepositions such as ‘the’ and ‘to’ from the text as they add no value to the NLP process. Most analysis starts with the basic stop words list which is then augmented based on the specific objective.
- Stemming is an NLP process that removes affixes, or additions to a root word by way of a prefix before the word or a suffix after the word. Stemming can help to group the different forms of a word together, for example, ‘guitar’ and ‘guitarist’.
- Lemmatization is the process of resolving words to their dictionary form, known as ‘lemma’. For example, tenses are removed, so ‘teaching’ and ‘taught’ both become ‘teach’. Synonyms are also unified during the lemmatization process, for instance ‘home’ becomes ‘house’. Lemmatization considers the context of the word used, as the same word can have different meanings depending on where and how it’s used.
- Topic modeling is a text-mining tool that helps to uncover topics in the text and find clusters of words related to different topics. Particular words that appear frequently in a document give an indication of the topics contained and the meaning. For example, let’s assume that the words ‘economy’ and ‘GDP’ will appear more frequently in a document about the economy, and ‘shipping’ and ‘logistics’ will appear more frequently if it’s about logistics. Then a document that contains 9 times more logistics-related words than economy-related ones can be assumed to be 90% about logistics and 10% about the economy.
The following are some areas in which NLP is proving to be extremely helpful in analyzing unstructured text data:
- Healthcare: In healthcare, NLP is helping to recognize and predict medical conditions based on electronic health records and analysis of patient conversations. NLP can extract medical conditions, medications, and treatment outcomes from multiple unstructured data sources such as notes written by healthcare professionals, medical reports and electronic health records.
- Sentiment Analysis: Brand managers need to know what people are saying about their brand on social media. Given the huge volume of unstructured information from social posts and noise, it is challenging to understand whether the dominant theme is positive or negative. NLP helps to conduct this sentiment analysis and provides insights into consumer behavior and decision drivers.
- Financial analysis: Financial analysts and traders can make informed decisions by analyzing news about companies. NLP is used to track news, such as reports of possible mergers, financial results, etc., and incorporates the analysis into trading algorithms.
- Recruitment: NLP can help to extract information about different candidates from their cover letters, emails and resumes that are in widely diverse formats.
- Legal analysis: NLP is a powerful method to find specific information and derive meaning from legal documents, such as contracts or court orders, which are tedious to read and process manually.
3. Image Analysis
Having discussed unstructured text data, we can now turn our attention to images. Images contain unstructured information that we need to understand. For example, diagnosing medical conditions by analyzing x-rays or MRI images.
Systems have now been developed that can retrieve images based on unstructured data, such as MRI images that match a certain brain volume, or X-rays of the spine based on the match with a given spine image. An input image can be provided, then feature extraction and similarity matching techniques are used to identify similar images.
AI-based image analysis is what makes autonomous vehicles possible, as they are able to identify objects on the road and know their location.
Image analysis has the potential to be used by brand managers to identify their logos or products in social media posts. This provides insights into product use situations, as well as the ROI of specific investment, such as event sponsorship.
Optical character recognition (OCR) technologies convert the text in image files into text data that can be read and processed.
4. Data Visualization
Data visualization is the graphical representation of data in a way that promotes easier understanding. Data presented using visualization techniques communicates visually with viewers, enabling them to engage and discover insights for themselves. Data visualization reveals intricate structure in data that cannot be appreciated in any other way and helps a wider range of people make sense of the data.
Visualization techniques can be used to highlight entities such as people, companies or cities that appear in the data. Visualization can also reveal topics or keywords, identify concepts, and is useful for presenting sentiment analysis.
Unstructured data that has been analyzed using techniques such as topic modeling or sentiment analysis can be presented using visualization techniques that would best suit the application and consumers of the information.
We saw some of the methods and technologies that can be used to process and analyze unstructured data. As this is an area where a lot of research is currently underway, the power of machine learning and artificial intelligence will undoubtedly improve over time. You can evaluate a process that suits you, depending on your end goals and the chosen technology stack.