Overview of BigData Sandboxes
February 11, 2019
The first thing you could probably think about is what a Sandbox is and why do we need to use it. Nobody actually asks you to use a sandbox, but it will make your life much easier as you will not need to setup anything on your own 😛
So, a Sandbox in general is a collection of pre-installed and pre-configured tools / software needed to be able to work on a particular task / use case. In most cases Sandboxes are fully functional – it means that everything is ready for usage and there are no restrictions how you use it.
If we talk about Sandboxes specifically for BigData, then we should mention two Sandboxes / vendors – Cloudera and Hortonworks. Actually as of beginning of 2019, both companies announced that they will merge into 1 company and probably at some point of time in future only 1 Sandbox will exist 🙂 but for now there are 2 separate Sandboxes from each of these companies.
Cloudera Sandbox is available for download absolutely for free here – https://www.cloudera.com/downloads/quickstart_vms/5-13.html . Both versions for VirtualBox and VMWare are available. It is a matter of taste which to use, myself personally prefer to use VirtualBox. Setup is straightforward and does not require any BigData skills. You just download a VirtualMachine (VM), note that registration is required for that, unzip the archive and double-click on file with the smaller size. Auto-import dialog will pop-up and some import options will be available for change. It is OK to import it “as is”, but if possible consider giving more RAM to this VM. 8 GB is absolute minimum, if possible give 16 GB or more to it and at least 2 physical cores. After that you can start VM once it is imported. Cloudera VM has graphical interface available directly on VM, that’s why it is considered to be more user-friendly.
Hortonworks Sandbox (Hortonworks Data Platform – HDP) also is available as VirtualBox and VMWare images. It could be downloaded from here – https://hortonworks.com/downloads/#sandbox . The Sandbox you need is found under HDP section. Installation is exactly the same as for Cloudera VM, but there are 2 differences in HDP VM if we compare it with Cloudera VM.
The first one is that HDP requires more memory 🙂 10 GB is minimum, but similarly as for Cloudera i would strongly recommend to give it at least 16 GB. 4 physical cores are suggested to be dedicated to this VM, you should accept that if, of course, you have some many physical cores on your processor.
The second difference is that HDP VM does not offer any graphical interface as Cloudera does. But do not panic! 🙂 There is a graphical cluster manager Ambari which will allow to configure / monitor your HDP VM, available on http://localhost:8080 . For development will need to use a separate machine, in opposite in Cloudera VM everything could be done inside VM.
These 2 Sandboxes are the most popular ones from industry leaders and are strongly recommended to be used. Some other companies like Yandex, have their own distributions, but in most cases these distribution are not publicly available and meant to do some specific tasks (in the most efficient manner).
Let’s get back to the question we asked ourselves in the beginning of this post – why do we need / want to use a Sandbox? Both Cloudera and Hortonworks distributions contain open-source and free software (mostly from Apache – apache.org )and in theory anyone could download and setup them by themselves. But the point is that you will need to configure all these tools, so they will be able to work together. In Sandbox everything is already pre-configured, so you do not need to worry about that 🙂
In the next post, which should be available next Monday, 18th of February, we will closely look at Cloudera VM, which tools are included in this VM, which tool is what used for, etc. After one week the same post will be available for Hortonworks VM as well.
During the weekend 16th-17th of February, also video guide, which is supposed to accompany this post, should be available in our youtube channel. In that video we will show setup process in details.
All posts on this blog are supposed to have a video, so would strongly recommend to subscribe to that channel. In general videos will be available a bit later than posts – a new post will be available each Monday and video will be available on Friday – Saturday the same week.
Thanks and stay turned as we have some many new and interesting things to learn together!