Sangala Shekhar Reddy: HBase cluster sizing issues.

Sunday, April 29, 2018

HBase cluster sizing issues.

Is it OK for the HBase Master and Hadoop NameNode (+JobTracker) to run on the same server?

The NameNode needs memory. The HBase Master is normally not very busy. It just needs to be available when region servers check in, and for maintaining timely ZooKeeper heartbeats. As long as there is sufficient RAM on the combined NameNode + Master (+ JobTracker) such that the system never swaps, running both on the same server is OK.

You can consider running multiple HBase masters to remove one Single-Point-Of-Failure from the deployment. For a non-high-availability deployment it makes sense to run all on one server. We would recommend running HBase masters with the Namenode and Secondary/Standby node, this will give you the necessary redundancy.
Is it OK for HBase RegionServer and Hadoop DataNode (+ TaskTracker) to run on the same server?

Yes this is advised to ensure local data. Eventually, the data in HDFS which backs the region stores is brought local through background compaction. The MapReduce jobs that run against HBase after this happens access data locally as each split corresponds to a region and the task will be scheduled on the corresponding region server.
Is HBase RegionServer is a memory hungry process?

Yes. The more RAM you can give to the region servers, the better for performance:
- Read caching (block cache) to avoid needing to hit the file system to serve frequently accessed data
- Write caching (MemStore) to ride over flushes and compactions without blocking clients
Do I need dedicated boxes for each ZooKeeper?

It is advised to run the Zookeeper on dedicated hardware. If that is not an option, you can run Zookeeper with the Namenode, Job Tracker, and Standby node(Secondary Namenode). In a pinch you can co-locate ZooKeeper on DataNode/TaskTracker/RegionServer boxes, but it is not recommended. ZooKeeper does not take up a lot of resources on its own, but when starved for resources it can cause timeouts of Region Servers.

ZooKeeper is a 2N+1 fault tolerant system, so deploy 3 servers if you can stand to lose only one, or 5 if you want to be able to lose up to 2, and so on. There are diminishing returns after 7 or 9. Though this may seem like a lot of overhead just to run HBase, ZooKeeper provides value such as for providing synchronization primitives for your service or application, hosting dynamic configurations (and using watchers to get notice of changes), and managing presence and group membership.
What's the minimum cluster size?

For a non-high-availability system with local disk, we recommend three RegionServer-TaskTracker-DataNodes with additional servers for each HBase Master-NameNode-JobTracker and ZooKeeper for something minimally useful. Also remember to tune HDFS for such a small cluster: set the minimum replication to 1 or 2.
For a high-availability system, we recommend the same three RegionServer-TaskTracker-DataNodes with two additional servers for each HBase Master-NameNode-JobTracker and ZooKeeper.

What are good starting points for HBase Master and RegionServer JVM Heap sizes?

Small development clusters can start with heap sizes of 1GB for the Master and 4GB for RegionServers.

Production clusters should plan on increasing the minimum RegionServer heap size to 16GB as a starting point. The HBase master process is not as memory intensive and will normally require less memory than RegionServers.

1 comment:

veera said...: Very nice blog,keep sharing more blogs with us.

Thanks you for info...

big data and hadoop training; August 19, 2020 at 10:45 PM