Marketing strikes again! Big Data is such a catchy, vague term it was destined to become a buzzword. It is much like that magical place, the “cloud”, where panda bears ride unicorns that sing “It’s a Small World After All”. In my job as a Partner Technology Strategist at Microsoft part of my responsibilities are to be technically deep in Data Platforms and Advanced Analytics. I can’t seem to convince my mom what that means and that, yes, it is a real job, but I digress. I spend quite a bit of time explaining what big data is and what it really means. In this post I will cover how I actually define “Big Data”, the data analytics pipeline, and an overview of lambda architecture, and finally how I talk about big data.
I love talking to people about their environments and their data. The environments vary wildly in size, and data type. But do they really have “Big Data”? Data is considered big data if it has one of the three Vs: volume, variety, and velocity.
For years, organizations have collected vast amounts of data. This trend is only increasing at an exponential rate. In a presentation I recently conducted for a partner, I used a few examples of scientific data collection.
- In 2000 the Sloan Digital Sky Survey collected more data in its 1st week than was collected in the entire history of Astronomy
- By 2016 the New Large Synoptic Survey Telescope in Chile will acquire 140 terabytes in 5 days – more than Sloan acquired in 10 years
- The Large Hadron Collider at CERN generates 40 terabytes of data every second
The amount of data the is being collected can reach into the 100’s of GB, TB, PB range. I saw a statistic recently that in 2010 Twitter was generating over 1 TB of tweets daily. While these examples are meant to be extreme I have worked with smaller organizations that have 100’s of TB of data and that qualifies it as having big data.
Variety refers to the type of data that an organization collects. In any given organization they may have structured data from their ERP system and unstructured data that they are collecting for brand analysis from social media. These two data sets vary not only by type but in schema as well. Organizations looking to make sense of these seemingly unrelated data types have big data. With these data types questions like the following can be analyzed: how can they tell if Twitter activity or brand sentiment are effecting sales?
While velocity is self explanatory, when it is used it the context of big data, the data is also typically small in size but enter the system at a rapid rate. This is the type of data generated by sensors, IoT devices, or SCADA systems. These type of environments may generate 100,000 1kb tuples per second.
Data Analytics Pipeline and Lambda Architecture
While there is certainly debate on additional ways to define big data, what we have established in this post allows us to shift focus on how we actually process the data. The stages of the data analytics pipeline follow the logical flow of the data: ingest, processing, storage, and delivery. When we discuss the three Vs, it is clear that there are many different types of data and the size that is needed to process can be quite large, enter lambda architecture.
Lambda architecture was designed to meet the challenge of handing the data analytics pipeline through two avenues, streaming and batch processing methods. These two data pathways merge just before delivery to create a holistic picture of the data. The streaming layer handles data with high velocity, processing them in real-time. The batch layer handles large volumes of data. Batch processing can take extended periods of time. By combining the layers the streaming data can fill in the time gap missing in the batch layer. The image below illustrates this concept.
In my role at Microsoft, I find myself having this discussion with not only partners but internal resources as well. I present it in a very similar format to this post. It is with this basic understanding, we are able to explore the more interesting topic of how Microsoft has created services on Azure to support this model and the interesting products and services our partners and customers are using them for. Are you using big data services on Azure? We would love to hear about them in the comments below. If you are interested in learning more about Azure you can find more posts and information on the US Partner Blog where I also write.