Abstract
Since the time Hadoop came up, the Hadoop ecosystem is getting larger and larger. There are so many softwares being developed around Hadoop. Apache HBase and Apache Hive are two of them.
In this expriment, for the purpose of learning these two softwares, we use HBase and Hive to continue our reseach on wuxia novels mentioned before.
Introduction
Apache HBase is a distributed column-oriented database designed by Google. It is proposed to store and query on big data in a small period.
Apache HBase runs with the HDFS, so we have to install it on an existing HDFS cluster. We still use Docker to create the image of HBase.
Zookeeper cluster
In this part, we are going to build up a totally distributed HBase cluster in docker. We will skip the process of deploying zookeeper cluster and assume that there is already one cluster of 3 nodes zookeeper-node1
, zookeeper-node2
and zookeeper-node3
.
You can find the scripts to setup a zookeeper cluster here scripts/zookeeper
If you are not familiar with HBase, you’d better follow the guide to setup the necessary zk cluster & HDSF so you don’t have to care about the configuration files at this time
Setup HDFS cluster
To quickly setup the HDFS cluster, you can read this post How to quickly setup a Hadoop cluster in Docker, or for production environment Setup a distributed Hadoop/HDFS cluster with docker.
Also, you can just run the scripts scripts/hadoop to create the cluster.
Build Docker image
If you don’t want to learn how to write a Dockerfile at this time, you can directlly turn to Setup the cluster.
Dockerfile
1 | FROM alpine:3.8 |
Configuration files
hbase-env.sh
1 | # Add this two lines to the default env file |
hbase-site.xml
1 | <?xml version="1.0"?> |
Do not update the HOSTNAME
value as it will be used later
backup-masters
1 | hbase-slave1 |
regionmasters
1 | hbase-slave1 |
Setup the cluster
hbase-master
1 | docker service create \ |
hbase-slave1
1 | docker service create \ |
hbase-slave2
1 | docker service create \ |
hbase-slave3
1 | docker service create \ |
Notice: all the data in containers are not persisted, so they will lose when restarts. see hbase.sh to view full script
Then enter the container by
1 | docker exec -it hbase-master.1.$(docker service ps \ |
If you have multi nodes, you have to execute docker exec -it hbase-master <TAB>
on that node
1 | bin/start-hbase.sh |
Validation
Inside hbase-master
container, execute
1 | bin/hbase shell |
start a proxy to access HBase web UI (optional)
There are so many hosts and ports in Hadoop cluster that it is messy and difficult to expose all of of them. So using a proxy to replace port mapping. The proxy uses protocal of socks5 which supports remote dns so that we can visit by the hostname such as http://hadoop-master:8088
to monitor Hadoop yarn.
1 | docker service create \ |
It is required to set socks5 proxy in browser.
After deployment, switch to the master node and start hbase with the command bin/start-hbase.sh
, soon after we can see the cluster stats in http://hadoop-master:16010
.
Access HBase ouside the swarm
Start another service to expose the thrift port 9090.
1 | docker service create \ |
1 | import happybase |
Notice: by default, hbase.regionserver.thrift.framed
and hbase.regionserver.thrift.compact
is set to true
for security reasons
Conclusion
HBase uses three levels to index and store data which makes it performs well when storing and querying. The column-oriended model is also helpful when doing analysis on certain columns.
Compare HBase with Hive, HBase is better for OLTP(OnLine Transaction Processing) while Hive works better in the field of OLAP(OnLine Analysis Processing).
References
Apache HBase ™ Reference Guide
Running MapReduce on HBase gives Zookeeper error
Get all values of all rows in Hbase using Java
xml文件中配置JDBC源遇到问题 : The reference to entity “characterEncoding” must end with the ‘;’ delimiter