Setup Apache Hive in Docker

Abstract

Since the time Hadoop came up, the Hadoop ecosystem is getting larger and larger. There are so many softwares being developed around Hadoop. Apache HBase and Apache Hive are two of them.

In this expriment, for the purpose of learning these two softwares, we use HBase and Hive to continue our reseach on wuxia novels mentioned before.

A short video for this post

Introduction

Apache Hive is a data warehouse. Before Hive, if we want to analyze data in HDFS, we have to design many mapreduces to do that. Even if data is stored in a structured format, there is no straight ways to operate on it.

Apache Hive intergrates many mapreduces and encapsulates them as many operations. On top of these, a SQL-like query language called HiveQL is used to make the operations. This is rather easy to learn for those who works on relational data warehouses. We can use HiveQL to query data from HDFS (Hadoop Distributed File System) just like we are using RDBMS with SQL.

Apache Hive runs on the Hadoop cluster, so we have to install it on an existing Hdoop cluster. We still use Docker to create the image of Hive.

Build Docker image

If you don’t want to learn how to write a Dockerfile currently, you can directlly turn to Initialization & Start.

Dockerfile

# Based on Hadoop image
FROM newnius/hadoop:2.7.4

USER root

# Download Apche Hive-2.1.1
RUN wget -O apache-hive.tar.gz http://www-eu.apache.org/dist/hive/hive-2.1.1/apache-hive-2.1.1-bin.tar.gz && \
tar -xzf apache-hive.tar.gz -C /usr/local/ && rm apache-hive.tar.gz

# Create a soft link to make futher upgrade transparent
RUN ln -s /usr/local/apache-hive-2.1.1-bin /usr/local/hive

# Set environment of hive
ENV HIVE_HOME /usr/local/apache-hive-2.1.1-bin
ENV PATH $PATH:$HIVE_HOME/bin

# Add mysql driver and default config file
ADD mysql-connector-java-5.1.44-bin.jar $HIVE_HOME/lib
ADD hive-site.xml $HIVE_HOME/conf

# Rename bootstrap script in origin image
RUN mv /etc/bootstrap.sh /etc/start_hadoop.sh

ADD bootstrap.sh /etc/bootstrap.sh
ADD init_hive.sh /etc/init_hive.sh

CMD ["/etc/bootstrap.sh", "-d"]

Bootstrap script

#!/bin/bash

# Run hadoop first of all
bash /etc/start_hadoop.sh -bash

# start metastore
hive --service metastore

if [[ $1 == "-d" ]]; then
  while true; do sleep 1000; done
fi

if [[ $1 == "-bash" ]]; then
  /bin/bash
fi

Configuration File

Apache Hive stores the meta data of managed tables such as table definition in metastore based on derby (local mode) or mysql (distributed mode). In local mode, users can only run HiveQL in the node which Hive metastore is installed. To better use Hive, we choose to use mysql as backend store of metastore.

The following configuration files defines the which database to use and the connection information of it.

<configuration>
	<property>
		<name>javax.jdo.option.ConnectionURL</name>
		<value>jdbc:mysql://mysql:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>
		<description>JDBC connect string for a JDBC metastore</description>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionDriverName</name>
		<value>com.mysql.jdbc.Driver</value>
		<description>Driver class name for a JDBC metastore</description>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionUserName</name>
		<value>root</value>
		<description>username to use against metastore database</description>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionPassword</name>
		<value>123456</value>
		<description>password to use against metastore database</description>
	</property>
	<property>
		<name>hive.metastore.uris</name>
		<value>thrift://hadoop-master:9083</value>
	</property>
	<property>
		<name>hive.metastore.schema.verification</name>
		<value>false</value>
	</property>
</configuration>

build docker image

1	docker build --tag newnius/hive:2.1.1 .

Initialization & Start Hive

We configured Hive to use Mysql as the metastore, so start mysql first.

docker service create \
	--name mysql \
	--replicas 1 \
	--network swarm-net \
	--endpoint-mode dnsrr \
	--detach=true \
	-e MYSQL_ROOT_PASSWORD=123456 \
	-e MYSQL_DATABASE=hive \
	mysql:5.7

Start a Hadoop cluster.

Follow How to quickly setup a Hadoop cluster in Docker to start a Hadoop Cluster.
If you already have one, you have to change the conf files located at /config/hadoop by mounting.

Start Hive.

Instead of running Hive with Hadoop nodes, we choose to run Hive in a seperate container thus it can be added/removed easily.

docker service create \
	--name hive \
	--hostname hive \
	--network swarm-net \
	--replicas 1 \
	--constraint node.role==manager \
	--endpoint-mode dnsrr \
	newnius/hive:2.1.1

Notice: all the data in containers are not persisted, so they will lose when restarts. see swarm_start_hive.sh to view full script

Before using Hive, we have to create a directory in HDFS for Hive managed data and initialize the metastore.

On HDFS namenode

hdfs dfs -mkdir /tmp
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod g+w /user/hive/warehouse

On Hive

schematool --dbType mysql --initSchema

# Run metastore after init
hive --service metastore

Insert data and Query

press hive to enter hive shell

Load data to Hive

CREATE TABLE Wuxia (
    word STRING,
    count DOUBLE
) row format delimited fields terminated by '\t';

1	LOAD DATA LOCAL INPATH '/tmp/word_occurrence.txt' INTO TABLE Wuxia;

you can download the file word_occurrence.txt

Make queries

Query words occured more than 300 times.

1	SELECT `word` from Wuxia WHERE `count` > 300;

Query the top 100 hottest words.

1	SELECT `word`,`count` from Wuxia WHERE `count` > 300 ORDER BY `count` DESC;

Conclusion

Querying data in HDFS with Apache Hive is very easy and clear. Although the performance of Hive is not remarkable due to limitations of Mapreduce, The ability of being able to process big volume of data can not be ignored.

Compare HBase with Hive, HBase is better for OLTP(OnLine Transaction Processing) while Hive works better in the field of OLAP(OnLine Analysis Processing).