Setup a Distributed HBase Cluster in Docker

Abstract

Since the time Hadoop came up, the Hadoop ecosystem is getting larger and larger. There are so many softwares being developed around Hadoop. Apache HBase and Apache Hive are two of them.

In this expriment, for the purpose of learning these two softwares, we use HBase and Hive to continue our reseach on wuxia novels mentioned before.

Introduction

Apache HBase is a distributed column-oriented database designed by Google. It is proposed to store and query on big data in a small period.

Apache HBase runs with the HDFS, so we have to install it on an existing HDFS cluster. We still use Docker to create the image of HBase.

Zookeeper cluster

In this part, we are going to build up a totally distributed HBase cluster in docker. We will skip the process of deploying zookeeper cluster and assume that there is already one cluster of 3 nodes zookeeper-node1, zookeeper-node2 and zookeeper-node3.

You can find the scripts to setup a zookeeper cluster here scripts/zookeeper

If you are not familiar with HBase, you’d better follow the guide to setup the necessary zk cluster & HDSF so you don’t have to care about the configuration files at this time

Setup HDFS cluster

To quickly setup the HDFS cluster, you can read this post How to quickly setup a Hadoop cluster in Docker, or for production environment Setup a distributed Hadoop/HDFS cluster with docker.

Also, you can just run the scripts scripts/hadoop to create the cluster.

Build Docker image

If you don’t want to learn how to write a Dockerfile at this time, you can directlly turn to Setup the cluster.

Dockerfile

FROM alpine:3.8

USER root

# Prerequisites
RUN apk add --no-cache openssh openssl openjdk8-jre rsync bash procps nss

ENV JAVA_HOME /usr/lib/jvm/java-1.8-openjdk
ENV PATH $PATH:$JAVA_HOME/bin

# Passwordless SSH
RUN ssh-keygen -q -N "" -t dsa -f /etc/ssh/ssh_host_dsa_key
RUN ssh-keygen -q -N "" -t rsa -f /etc/ssh/ssh_host_rsa_key
RUN ssh-keygen -q -N "" -t rsa -f /root/.ssh/id_rsa
RUN cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys

ADD ssh_config /root/.ssh/config
RUN chmod 600 /root/.ssh/config
RUN chown root:root /root/.ssh/config

RUN echo "Port 2122" >> /etc/ssh/sshd_config

# unlock root
RUN passwd -u root

# Install Apche HBase
ENV HBASE_VER 1.2.6

RUN wget -O apache-hbase.tar.gz https://archive.apache.org/dist/hbase/$HBASE_VER/hbase-$HBASE_VER-bin.tar.gz && \
	tar xzvf apache-hbase.tar.gz -C /usr/local/ && rm apache-hbase.tar.gz

# Create a soft link to make future upgrade transparent
RUN ln -s /usr/local/hbase-$HBASE_VER /usr/local/hbase

ENV HBASE_HOME /usr/local/hbase
ENV PATH $PATH:$HBASE_HOME/bin

# Add default conf files of 1 master, 2 back server, 3 engionserver
ADD hbase-site.xml $HBASE_HOME/conf
ADD hbase-env.sh $HBASE_HOME/conf
ADD regionservers $HBASE_HOME/conf
ADD backup-masters $HBASE_HOME/conf

WORKDIR /usr/local/hbase

ADD bootstrap.sh /etc/bootstrap.sh

CMD ["/etc/bootstrap.sh"]

Configuration files

hbase-env.sh

1
2
3

# Add this two lines to the default env file
export JAVA_HOME=/usr/lib/jvm/java-1.8-openjdk/
export HBASE_MANAGES_ZK=false

hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://hadoop-master:8020/hbase</value>
  </property>
  <property>
    <name>hbase.master</name>
    <value>hbase-master:60000</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>zookeeper-node1,zookeeper-node2,zookeeper-node3</value>
  </property>
  <property>
    <name>hbase.regionserver.thrift.framed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.regionserver.thrift.compact</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.regionserver.dns.interface</name>
    <value>eth0</value>
  </property>
  <property>
    <name>hbase.regionserver.hostname</name>
    <value>HOSTNAME</value>
  </property>
  <property>
    <name>hbase.regionserver.hostname.disable.master.reversedns</name>
    <value>true</value>
</property>
</configuration>

Do not update the HOSTNAME value as it will be used later

backup-masters

1	hbase-slave1

regionmasters

1
2
3

hbase-slave1
hbase-slave2
hbase-slave3

Setup the cluster

hbase-master

docker service create \
	--name hbase-master \
	--hostname hbase-master \
	--network swarm-net \
	--replicas 1 \
	--detach=true \
	--endpoint-mode dnsrr \
	newnius/hbase:1.2.6

hbase-slave1

docker service create \
	--name hbase-slave1 \
	--hostname hbase-slave1 \
	--network swarm-net \
	--replicas 1 \
	--detach=true \
	--endpoint-mode dnsrr \
	newnius/hbase:1.2.6

hbase-slave2

docker service create \
	--name hbase-slave2 \
	--hostname hbase-slave2 \
	--network swarm-net \
	--replicas 1 \
	--detach=true \
	--endpoint-mode dnsrr \
	newnius/hbase:1.2.6

hbase-slave3

docker service create \
	--name hbase-slave3 \
	--hostname hbase-slave3 \
	--network swarm-net \
	--replicas 1 \
	--detach=true \
	--endpoint-mode dnsrr \
	newnius/hbase:1.2.6

Notice: all the data in containers are not persisted, so they will lose when restarts. see hbase.sh to view full script

Then enter the container by

1 2	docker exec -it hbase-master.1.$(docker service ps \ hbase-master --no-trunc \| tail -n 1 \| awk '{print $1}' ) bash

If you have multi nodes, you have to execute docker exec -it hbase-master <TAB> on that node

1
2
3

bin/start-hbase.sh

bin/hbase-daemon.sh start thrift

Validation

Inside hbase-master container, execute

bin/hbase shell

create 'test', 'cf'

desc 'test'

list 'test'

put 'test', 'row1', 'cf:a', 'value1'

put 'test', 'row2', 'cf:b', 'value2'

put 'test', 'row3', 'cf:c', 'value3'

scan 'test'

get 'test', 'row1'

start a proxy to access HBase web UI (optional)

There are so many hosts and ports in Hadoop cluster that it is messy and difficult to expose all of of them. So using a proxy to replace port mapping. The proxy uses protocal of socks5 which supports remote dns so that we can visit by the hostname such as http://hadoop-master:8088 to monitor Hadoop yarn.

docker service create \
	--replicas 1 \
	--name proxy_docker \
	--network swarm-net \
	-p 7001:7001 \
	newnius/docker-proxy

It is required to set socks5 proxy in browser.

After deployment, switch to the master node and start hbase with the command bin/start-hbase.sh, soon after we can see the cluster stats in http://hadoop-master:16010.

Access HBase ouside the swarm

Start another service to expose the thrift port 9090.

docker service create \
	--name hbase-forwarder \
	--network swarm-net \
	--replicas 1 \
	--detach=true \
	--env REMOTE_HOST=hbase-master \
	--env REMOTE_PORT=9090 \
	--env LOCAL_PORT=9090 \
	--publish mode=host,published=9090,target=9090 \
	newnius/port-forward

import happybase
def get_tables_name(host,port):
    conn = happybase.Connection(host=host,port=port,protocol='compact',transport='framed')
    return conn.tables()

Notice: by default, hbase.regionserver.thrift.framed and hbase.regionserver.thrift.compact is set to true for security reasons

Conclusion

HBase uses three levels to index and store data which makes it performs well when storing and querying. The column-oriended model is also helpful when doing analysis on certain columns.

Compare HBase with Hive, HBase is better for OLTP(OnLine Transaction Processing) while Hive works better in the field of OLAP(OnLine Analysis Processing).