Setup a Distributed HBase Cluster in Docker

Abstract

Since the time Hadoop came up, the Hadoop ecosystem is getting larger and larger. There are so many softwares being developed around Hadoop. Apache HBase and Apache Hive are two of them.

In this expriment, for the purpose of learning these two softwares, we use HBase and Hive to continue our reseach on wuxia novels mentioned before.

Introduction

Apache HBase is a distributed column-oriented database designed by Google. It is proposed to store and query on big data in a small period.

Apache HBase runs with the HDFS, so we have to install it on an existing HDFS cluster. We still use Docker to create the image of HBase.

If you don’t want to learn how to write a Dockerfile at this time, you can directlly turn to Setup the cluster.

Build Docker image

In this part, we are going to build up a totally distributed HBase cluster in docker. We will skip the process of deploying zookeeper cluster and assume that there is already one cluster of 3 nodes zookeeper_node1, zookeeper_node2 and zookeeper_node3.

You can find the script to setup a zookeeper cluster here zookeeper.sh

Dockerfile

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
FROM newnius/hadoop:2.7.4

USER root

# Install Apche HBase
RUN wget -O apache-hbase.tar.gz http://mirrors.ocf.berkeley.edu/apache/hbase/1.2.6/hbase-1.2.6-bin.tar.gz && \
tar xzvf apache-hbase.tar.gz -C /usr/local/ && rm apache-hbase.tar.gz

# Create a soft link to make future upgrade transparent
RUN ln -s /usr/local/hbase-1.2.6 /usr/local/hbase

ENV HBASE_HOME /usr/local/hbase
ENV PATH $PATH:$HBASE_HOME/bin

# Add default conf files of 1 master, 2 back server, 3 engionserver
ADD hbase-site.xml $HBASE_HOME/conf
ADD hbase-env.sh $HBASE_HOME/conf
ADD regionservers $HBASE_HOME/conf
ADD backup-masters $HBASE_HOME/conf

WORKDIR /usr/local/hbase

CMD ["/etc/bootstrap.sh", "-d"]

Configuration files

hbase-env.sh

1
2
export JAVA_HOME=/usr/lib/jvm/java-1.8-openjdk/
export HBASE_MANAGES_ZK=false

hbase-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://hadoop-master:8020/hbase</value>
</property>
<!--
<property>
<name>hbase.master</name>
<value>hadoop-master:60000</value>
</property>
-->
<property>
<name>hbase.zookeeper.quorum</name>
<value>zookeeper_node1,zookeeper_node2,zookeeper_node3</value>
</property>
<property>
<name>hbase.regionserver.dns.interface</name>
<value>eth0</value>
</property>
</configuration>

backup-masters

1
hadoop-slave1

regionmasters

1
2
3
hadoop-slave1
hadoop-slave2
hadoop-slave3

Setup the cluster

hadoop-master

1
2
3
4
5
6
7
8
9
10
docker service create \
--name hadoop-master \
--network swarm-net \
--hostname hadoop-master \
--replicas 1 \
--detach=true \
--mount type=bind,source=/mnt/share/hadoop/,target=/share \
--mount type=bind,source=/etc/localtime,target=/etc/localtime \
--endpoint-mode vip \
newnius/hbase:1.2.6

hadoop-slave1

1
2
3
4
5
6
7
8
9
docker service create \
--name hadoop-slave1 \
--network swarm-net \
--hostname hadoop-slave1 \
--replicas 1 \
--detach=true \
--mount type=bind,source=/etc/localtime,target=/etc/localtime \
--endpoint-mode vip \
newnius/hbase:1.2.6

hadoop-slave2

1
2
3
4
5
6
7
8
9
docker service create \
--name hadoop-slave2 \
--network swarm-net \
--hostname hadoop-slave2 \
--replicas 1 \
--detach=true \
--mount type=bind,source=/etc/localtime,target=/etc/localtime \
--endpoint-mode vip \
newnius/hbase:1.2.6

hadoop-slave3

1
2
3
4
5
6
7
8
9
docker service create \
--name hadoop-slave3 \
--network swarm-net \
--hostname hadoop-slave3 \
--replicas 1 \
--detach=true \
--mount type=bind,source=/etc/localtime,target=/etc/localtime \
--endpoint-mode vip \
newnius/hbase:1.2.6

Notice: all the data in containers are not persisted, so they will lose when restarts. see hbase.sh to view full script

start a proxy to access HBase web UI (optional)

There are so many hosts and ports in Hadoop cluster that it is messy and difficult to expose all of of them. So using a proxy to replace port mapping. The proxy uses protocal of socks5 which supports remote dns so that we can visit by the hostname such as http://hadoop-master:8088 to monitor Hadoop yarn.

1
2
3
4
5
6
docker service create \
--replicas 1 \
--name proxy_docker \
--network swarm-net \
-p 7001:7001 \
newnius/docker-proxy

It is required to set socks5 proxy in browser.

After deployment, switch to the master node and start hbase with the command bin/start-hbase.sh, soon after we can see the cluster stats in http://hadoop-master:16010.

Conclusion

HBase uses three levels to index and store data which makes it performs well when storing and querying. The column-oriended model is also helpful when doing analysis on certain columns.

Compare HBase with Hive, HBase is better for OLTP(OnLine Transaction Processing) while Hive works better in the field of OLAP(OnLine Analysis Processing).

References

Apache HBase ™ Reference Guide

HBase-1.2.4分布式集群搭建

HBase之完全分布式模式安装

Running MapReduce on HBase gives Zookeeper error

Get all values of all rows in Hbase using Java

xml文件中配置JDBC源遇到问题 : The reference to entity “characterEncoding” must end with the ‘;’ delimiter