What makes big data ‘big’ and how does it differ from our traditional understanding of data? The difference is not really in size. In other words, there is no clear demarcation — you can’t say that data which is larger than “x” size becomes big data.
Although we have been storing and processing data for decades, the rate of data generation has accelerated substantially, particularly in recent years. As new technologies enabled us to create and manage the exponential growth, availability, and usage of structured and unstructured data, we began to call it ‘big data’.
Big data is a term that applies to such high volumes of data in heterogeneous formats, including unstructured, and is growing so rapidly that traditional tools and approaches cannot be used to handle, process, analyze or present it. Big data is also the term that summarizes the processes, tools, and techniques used to derive insights from such data. Ultimately, big data refers to the massive volumes of data that now enable you to solve business problems that traditional data could not tackle.
As Gartner defined it in 2001: ‘Big data is data that contains greater variety arriving in increasing volumes and with ever-higher velocity.’
Actually, big data is different from data that was used traditionally, as the objectives, plans, processes, and tools are very different. Let’s now turn to the characteristics that differentiate big data from traditional data.
- Flexibility
A traditional database is based on a fixed schema that is static in nature. It could only work with structured data that fit effortlessly into relational databases or tables. In reality, most data is unstructured. The extensive variety of unstructured data requires new methods to store and process. Some examples include movies and sound files, images, documents, geolocation data, text, weblogs, strings, and web content.
Big data uses a dynamic schema that can include structured as well as unstructured data. The data is stored in a raw form and the schema is applied only when accessing it.
For big data analytics, datasets from diverse sources are appended, then functions such as storing, cleansing, distributing, indexing, transforming, searching, accessing, analyzing, and visualizing are performed.
- Real-time analytics
Traditionally, analytics always took place after the event or time period that was being analyzed. With big data, analytics takes place in real-time — as the data is being gathered — and findings are presented practically instantaneously. This capability enables breakthroughs in medical, safety, smart cities, manufacturing and transportation domains.
- Distributed architecture
While traditional data is based on a centralized database architecture, big data uses a distributed architecture. Computation is distributed among several computers in a network. This makes big data far more scalable than traditional data, in addition to delivering better performance and cost benefits. The use of commodity hardware, open-source software, and cloud storage makes big data storage even more economical.
Once data quality checks and data normalization are conducted, the data is modeled so that it can be stored in a data warehouse.
- Multitude of sources
Traditionally, the sources of data were fairly limited. Today there is a data explosion thanks to a multitude of sources that capture data practically every moment. Readings from medical equipment, air particle counters, crowd density calculators, and embedded devices in vehicles are only a few examples that show the huge volume, as well as variety, of big data from different sources.
- Enables exploratory analysis
In the traditional approach to data analytics, users had to determine their questions at the start. Data was structured in order to find the answers to their questions and then reports would be generated.
Big data, however, enables a more iterative and exploratory approach. The focus is to develop a platform for creative discovery so that users can explore what questions can be asked. In a business scenario, the traditional approach led to the creation of monthly reports, productivity analysis, customer survey findings, etc. Big data provides insights into sentiment analysis, product strategy, asset utilization, preventive maintenance of equipment, etc.
In order to reap all the benefits of big data analytics, experts and technology platforms have to overcome a variety of challenges. These pertain to scale, heterogeneity, lack of structure, error-handling, privacy, timeliness, provenance, and visualization. While there are challenges during all stages — from data acquisition to interpretation — big data analytics technologies and tools are evolving every day.