Book Image

Learning HBase

By : Shashwat Shriparv
Book Image

Learning HBase

By: Shashwat Shriparv

Overview of this book

Table of Contents (18 chapters)
Learning HBase
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Capacity planning


Suppose we have around 2 TB data with a replication factor of 3, which means 3 * 2 = 6 TB, which in turn means that 2 TB of extra space is still needed. So, for 2 TB of data, we can have a cluster with 4 to 8 DataNodes, totaling 8 TB of storage disk.

This extra space is needed for an intermediate temporary file that is generated during read/write operations and MapReduce jobs. If the data on which we run MapReduce is huge and the MapReduce code processes the whole data that requires a huge HDFS storage to store the temporary and intermediate result files, we will need to provide enough disk storage, the absence of which will result in a lot of failing tasks and blacklisted nodes. It is advisable to have 25 to 50 percent more storage than the original data size (without a replication factor) on the cluster; the minimum should be 25 percent more of the whole data size if we want to run a MapReduce task without much failing.

So, we can apply an approximate formula, as follows...