One of the challenge to start with big data projects is to identify the cluster requirements. In this post, we will discuss about calculating cluster size based on (application) data. Below are the important factors for calculating
Average Compression Ratio(c) : Default value is 1, if we want to store data without any compression. If we would like to store data with compression, below is the formula (1- (Compressed Size / UnCompressed Size) )
Example: Let’s assume size of the data without compression is 10TB. After compression, the data size is reduced to 7TB. In this case, Average compression ratio is (1-(7/10)) = 0.3. We are considering default value i.e 1 for this example
Replication factor( r): Replication factor is on of the core priciple in HDFS storage. Default number is 3 Size of data(S): Size of the data estimated for application storage Intermediate factor(i): This is important, as we need to leave some space in each data node to process the data locally. Recommended is 25%.
With all the above values, calculate Hadoop Storage(H)=c*r*S/(1-i) = 1*3*20TB/(1-25%) = 60TB/0.75 = 80TB
Now calculate datanodes required to store the data. Consider, each data node is 5TB. Number of datanodes (D) = Hadoop Storage/Disk space per node = H/d = 80TB/5TB = 16 We need 16 datanodes for entire storage.
And we need a namenode and 2 secondary namenodes, i.e Master(M) = 3.
Cluster size = M + D = 16 + 3 = 19
So, we need a cluster size of 19 nodes.