HDFS overview
March 1, 2019
Last week we released a video on our youtube channel about one of the central components of Hadoop ecosystem – HDFS. If you have not see it yet, that’s the right time to do that now
In this post will talk about theory related to HDFS – what it actually is, why is it needed, how it works
Let’s start by looking at the reasons why HDFS is needed. As its name implies Hadoop Distributed FileSystem (HDFS) it is distributed filesystem, not a local one. As Hadoop is supposed to be run on cluster, multiple nodes (servers) are involved. So we should have a way how files are stored / accessed by all nodes in this cluster. Obviously, local filesystem is not the best way where to store files all other nodes need access to. Technically, an access to files from local filesystem could be granted via SSH / FTP – there is a special “shared” folder on each node where we put all files which should be shared among all nodes. By for that we need SSH / FTP to run on each node, we should have an efficient access control system and still this “share” folder is a single point of failure. The solution could be to copy the content to multiple “share” folders among multiple nodes, but that makes this system even more complicated. And the worst thing would be that such a system is completely a custom system – no big community to ask questions from, no multiple developers to move system’s functionality forward (and for free), etc. So, why to use something custom if HDFS already exists, right? 🙂
HDFS offers everything mentioned in the previous paragraph, plus it is completely free, integrated in Hadoop ecosystem, all BigData tools are absolutely compatible with it and utilize it heavily!
So, what still HDFS is, what are pros and cons here? Technically speaking HDFS is a Java application which runs on each of nodes within the cluster and accepts multiple commands from each of the nodes. For us the most important thing is that it takes care of all things mentioned above, like access policies, syncronization, replication of data, etc.
Each server in the cluster plays one of two possible roles with regards to HDFS – it could be either NameNode or DataNode.
NameNode is the most crucial component of HDFS as it manages all HDFS related metadata. For example, NameNode has a special structure where it is mentioned which file is separated into how many parts, where each part is stored (on which server), etc. Based on this info when a client wants to pull some data, NameNode could provide all needed information, so a requested file could be delivered to the client. In each cluster there should be 1 NameNode and optionally backup NameNode which would take NameNode role once primary NameNode is down for any reason. If no working NameNode available in the cluster, data cannot be retrieved / written from / to HDFS.
In the cluster should be at least one DataNode, where actual data from HDFS is stored. If you want your data to be replicated, you obviously need multiple DataNodes. Replication factor parameter is responsible for defining how many times each part of the file will be replicated. Default value is 3, that means, that every single piece of file will be replicated 3 times, on 3 different DataNodes (of course, if there are 3 DataNodes in your cluster).
Some more things to keep in mind about HDFS – all files / data written in HDFS is immutable. It means that you cannot edit any file once it is saved on HDFS. There are some tools, like Hue showed in the video, which allows to edit files. But what is actually does – it deletes the original file and saves your edited file on HDFS with the original file name. So no edits are possible.
Another limitation related to HDFS – it does not work well with small files. It is designed to work with large files and its engine is optimized for that. To work well, file size should be equal to the size of a block – a piece of data all your files are cut into. If you have a lot of files / blocks, which size is smaller, overall performance of the cluster will be significantly dropped. To somehow deal with that, a special application called “compaction job” is included, which compacts small files / blocks into major ones.
That would be a short overview of HDFS, do not forget to subscribe to our channel, new video is about to be available in upcoming couple of days. Stay turned!
DataGuru Team