We are running a HBase (Currently 0.20.4) cluster on ec2. I thought it will be useful for others to know some tips about running HBase in EC2.
1) Use private dns addresses in config files such as hdfs-site.xml, hbase-site.xml. On ec2 Ubuntu instances java’s getHost() gets resolved to the private dns addresses.
2) Use c1.xlarge or bigger node to start with. I have seen Andrew Purtell (HBase committer) recommending this on HBase mailing list. We have tried m1.large machines. It has worked well for us when our traffic was small. We are hitting HBase in real time. As traffic started increasing we started getting CPU maxouts. Currently we use c1.xlarge machines.
3) Define a security group for all of your Hadoop/HBase nodes. Hadoop, HBase, Zookeeper nodes need to talk with each other on many ports. It’s better to assign one security group to all the nodes and give that group permission to talk to it self.
4) Use dfs.host property in hdfs-site.xml. This property allows you to specify a path of a file with a list of dns addresses of allowable nodes. Only nodes listed in the file will be allowed to join the Hadoop cluster. This becomes important in case you are spanning a QA cluster in ec2. You don’t want your QA node to join your production cluster accidentally. If you add a new node, you need execute the following command to refresh the allowable nodes list to avoid entire cluster restart:
hadoop dfsadmin -refreshNodes
I am assuming here that you are not running map reduce on the same cluster. If you want to run map reduce on the same cluster do not use a similar setting in mapred-site.xml. There is no refreshNodes option available for mradmin. It’s coming up in 0.21.
5) Use EBS backed instances for a Hadoop/HBase nodes. There are many benefits of using EBS volumes instead of the s3 backed ones. For example, let’s say you tried a different vm args on a node to test the performance. And if the very same node goes down, you don’t have to worry about loosing those settings. All you need to do is take a snapshot of that volume and spawn an instance. I generally find it easier to deal with EBS backed instances than S3 backed instances. The added cost is too small in comparison with the added convenience of not loosing data.
6) Prepare an AMI with Hadoop, HBase installation and with all the configurations. This will allow you to add a node quickly when you want to do so.
7) Use Elastic Map Reduce to run map reduce jobs with your HBase. We have found this much more cost effective. Let’s say you have a 7 node HBase cluster. But your map reduce jobs need at least 10 nodes for it to finish on time. Elastic map reduce will allow you to spawn a 10 node job with your HBase. The load on your HBase cluster will reduce and you don’t have to pay for 3 extra nodes 24 x 7. Furthermore, you can spawn cheaper instances for your EMR job than you are using for your HBase (c1.xlarge) if you want to. We use c1.mediums.
HBase guys have written scripts to spawn a cluster in ec2. The scripts makes it easier to spawn an entire cluster. Don’t hesitate to try it out. Let me know if you have any other tips that you would like to share. I am sure many people are running their HBase clusters on ec2. I am eager to learn from their experiences.