Intro to BigData world
February 6, 2019
Nowadays it is quite popular words in IT – “Big Data” or “BigData”, but what it actually is? How can you define this “BigData”? Where the “BigData” starts? If i have, let’s say, 100 GB of data, am I working as BigData engineer or not?
There are no strict definitions what should be considered to be Big Data, but there are some features which are applicable to it. So if the data you are working with satisfies them, then most probably you are a BigData Engineer 🙂
Rule of 3 V
The most popular definition of Big Data is related to 3 “V” – Volume, Velocity, Variety.
Volume is the most obvious one – the volume of the data you deal with is big. But how big it should be? No strict answer on that. At certain circumstances 10 GB could be considered as BigData, sometimes even 1 TB is not that much 🙂 The definition about volume i like the most is related to amount of data you work on VS amount of memory you have. If amount of available memory is less than volume of data, that is, you cannot process everything in memory, then it is considered to be Big Data. Of course, there are some exceptions as well. For example, a Big Data framework Apache Spark (which we will learn in details later) processes everything in memory (or at least tries to do that). In this case our definition of Volume is not applicable, but still we work with BigData. Confusing? Of course 😀
Velocity stands for how fast new data arrives and also how fast you need / must process it. A few years ago it was quite usual that we have some files arriving, for example, once an hour and you need to process it once a day when you have a batch of such files. Nowadays it is too slow as we want to process data and get some insights as soon as possible. That’s why data streaming pipelines are getting more and more popular. Hundreds, thousands or even millions of events (peaces of data) fires every second and we need to do something with that. One of the most popular and demanded framework to deal with streaming is Apache Kafka + Apache Spark. It is very popular combination and each Data Engineer should be aware how to work with it. A lot of articles / manuals will be available in our blog related to Kafka + Spark, so stay turned.
Variety is also quite obvious – we should be ready / able to deal with different forms of data arriving in different formats. It could be some structured data arriving in CSV format, where each column is separated with “,”; it could be some completely unstructured data, like posts in social networks and we should be able to process it and interpret as well. Because there is no point in just storing data if you cannot get use of it.
The last point to discuss in this post – “Big Data” or “BigData”? 😀
No answer on this question as well, there are a lot of articles / books where you can find one or another form. In this post you can find both forms as well 😀
Personally myself prefer using “BigData” and probably in upcoming posts you will see it more often.
Hopefully this post was useful for you and we would like to get some feedback as well. So please, put a comment how do you feel about it? Was it OK or something should be adjusted?
Next time as it was promised, we’ll talk about BigData sandboxes / Virtual Machines. The post should be available next Monday, 11th of February. Stay turned!