In order to make meaning out of data, it has to be effectively organized, stored and analyzed. That’s why it’s important to know the nature of the data — specifically whether it’s structured or unstructured. Structured data fits into predefined fields and can be organized into a spreadsheet or a relational database, while unstructured data is heterogenous and does not map to standard fields.
There is a third type of data, however, that falls somewhere between structured and unstructured, and that’s called semi-structured data. Semi-structured data does not have a rigid schema, so it does not conform to a data table or relational database structure. However, it has classifying characteristics associated with it, such as internal semantic tags or metadata or markings, that enable analysis.
Metadata or other markers associated with unstructured data make it possible to separate semantic elements and create hierarchies of records and fields. For this reason, unstructured data is also called self-describing.
What is the nature of semi-structured data, how is it created and used, and what are the challenges associated with it? We look at 5 key things about semi-structured data that we need to know in order to process and use it.
1. What is metadata?
Metadata is ‘data about data’. It’s a small portion of a file that contains data about the contents of the file. This may include how the data was created, its purpose, author, file size, length, sender, recipient, etc. Metadata enables semi-structured data to be cataloged, searched, queried and analyzed.
As an example, let’s consider an image file. It consists of mainly unstructured data — a large number of pixels. However, metadata associated with an image makes it semi-structured data, and this metadata can indicate the subject, time when created, name of the creator, description, and so on, enabling categorization and analysis.
2. Sources of semi-structured data
How does semi-structured data get created? A few examples of semi-structured data sources are emails, XML and other markup languages, binary executables, TCP/IP packets, zipped files, data integrated from different sources, and web pages. The growing volume of semi-structured data is partly due to the growing presence of the web, as well as the need for flexible formats for data exchange between disparate databases. In addition, certain scientific databases that require a broader mix of structural and text data, along with annotations and attribute extensibility, also create this kind of data.
Where application data does not have a rigidly and predefined schema, semi-structured data is created. The schema may be descriptive, partial, evolving, and very large.
3. The nature of semi-structured data
Let’s take a look at the typical nature of semi-structured data. It is organized into semantic entities and similar entities are grouped together. It’s not necessary that entities in the same group have the same attributes. The order of attributes is not necessarily important, and all attributes may not be required. The size and type of the same attributes within a group may differ.
There are different ways to extract information from semi-structured data. Graph-based models, or object exchange models (OEM), can be used to index the data. OEM data modeling techniques enable the data to be stored in graph-based models that are easier to search and index. Another option is XML, which allows hierarchies to be created and facilitates index and search. Data mining tools are also used to extract information from semi-structured data.
4. How you can use semi-structured data
The use of semi-structured data enables us to integrate data from various sources or exchange data between different systems. Applications and systems need to evolve with time, but if we work purely with structured data, this is not possible. Let’s consider web forms. You may want to modify forms and capture different data for different users. If you are using a traditional relational database, the database schema needs to be changed each time a new field is needed, and fields can not be left empty. Semi-structured data can allow you to capture any data in any structure without making changes to the database schema or coding. Adding or removing data does not impact functionality or dependencies.
When you work with semi-structured data, you get a flexible representation, and you do not need configuration or code changes if the data evolves over time. Data from multiple sources with differences in notation and meaning can be collected and used. Relationships are described as references and are incorporated completely into parent objects (tree). Semi-structured data makes it possible to maintain and support complex query types of data structure and storage, while keeping the relationships between objects and complex schema. Queries and reporting over many systems and data types are now possible.
5. Challenges of handling semi-structured data
While semi-structured data increases flexibility, the lack of a fixed schema also creates storage and indexing challenges. The schema and data are tightly coupled and inter-dependent and a query may update both. It is also challenging to run queries. OEM and XML formats help to store and exchange semi-structured data, and can overcome some of these challenges.
As the volume of semi-structured data continues to grow, new ways to manage, collate, integrate, store and analyze it will evolve. Semi-structured data can help us to capture and process data as it really is, without forcing it into an unnatural structure. Knowledge about the nature of semi-structured data and ways to use it is extremely important considering the growing volume of this kind of data.