How to quickly setup a Hadoop cluster in Docker

Foreword

This post mainly aims to show you how to build a docker image of Hadoop and how to setup a distributed Hadoop cluster (only) for experiment use (even on single machine).

If you want to deploy a large scale cluster in production, you can read Setup a distributed Hadoop cluster with docker for more information.

A video for this post

Background

Althought Hadoop can be installed in Single-Node mode, the best way to learn Hadoop is setup a distributed cluster as it is in production environment. It is not easy to setup a Hadoop cluster in distributed mode in single personal computer, so docker comes out.

Docker is a virtualization technique based on linux namespace and cgroup, but it comsumes less extra resources compared with other virtualization techniques such as VirtualBox.

Docker provides the ability to package and run an application in a loosely isolated environment called a container. The isolation and security allow you to run many containers simultaneously on a given host. Containers are lightweight because they don’t need the extra load of a hypervisor, but run directly within the host machine’s kernel. This means you can run more containers on a given hardware combination than if you were using virtual machines. You can even run Docker containers within host machines that are actually virtual machines!

With the help of swarm, it is rather convenient to setup a cluster which has the ability of service discovery, load balance, retry on failure, fast deploy and so on.

Here is how to quickly setup a distributed Hadoop cluster in docker.

Install Docker (Requires 1.13 or later)

1
2
3
4
5
6
7
8
# install newest version
curl -fsSL https://get.docker.com | bash

# add current user to docker group so that it can execute docker commands
sudo usermod -aG docker $USER

# Start docker
sudo systemctl start docker

logout current session and re-login to take effect.

1
2
3
4
5
# init a docker swarm cluster and listens on localhost
docker swarm init --advertise-addr 127.0.0.1

# create an overlay network
docker network create --driver overlay swarm-net

Create Dockerfile

If you don’t want to learn how to write a Dockerfile currently, you can directlly turn to Start Hadoop Cluster.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
FROM alpine:3.6

MAINTAINER Newnius <newnius.cn@gmail.com>

# use root directlly as there is no security issues in containers (use root or not)
USER root

# install required packages
RUN apk add --no-cache openssh openssl openjdk8-jre rsync bash procps nss

# set JAVA_HOME
ENV JAVA_HOME /usr/lib/jvm/java-1.8-openjdk
ENV PATH $PATH:$JAVA_HOME/bin

# configure passwordless SSH
RUN ssh-keygen -q -N "" -t dsa -f /etc/ssh/ssh_host_dsa_key
RUN ssh-keygen -q -N "" -t rsa -f /etc/ssh/ssh_host_rsa_key
RUN ssh-keygen -q -N "" -t rsa -f /root/.ssh/id_rsa
RUN cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys

ADD ssh_config /root/.ssh/config
RUN chmod 600 /root/.ssh/config
RUN chown root:root /root/.ssh/config

RUN echo "Port 2122" >> /etc/ssh/sshd_config

RUN passwd -u root

# install Hadoop
RUN wget -O hadoop.tar.gz https://archive.apache.org/dist/hadoop/common/hadoop-2.7.4/hadoop-2.7.4.tar.gz && \
tar -xzf hadoop.tar.gz -C /usr/local/ && rm hadoop.tar.gz

# create a soft link to make it transparent when upgrade Hadoop
RUN ln -s /usr/local/hadoop-2.7.4 /usr/local/hadoop

# set Hadoop enviroments
ENV HADOOP_HOME /usr/local/hadoop
ENV PATH $PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

ENV HADOOP_PREFIX $HADOOP_HOME
ENV HADOOP_COMMON_HOME $HADOOP_HOME
ENV HADOOP_HDFS_HOME $HADOOP_HOME
ENV HADOOP_MAPRED_HOME $HADOOP_HOME
ENV HADOOP_YARN_HOME $HADOOP_HOME
ENV HADOOP_CONF_DIR $HADOOP_HOME/etc/hadoop
ENV YARN_CONF_DIR $HADOOP_PREFIX/etc/hadoop

# add default config files which has one master and three slaves
ADD core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml
ADD hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml
ADD mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml
ADD yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml
ADD slaves $HADOOP_HOME/etc/hadoop/slaves

# update JAVA_HOME and HADOOP_CONF_DIR in hadoop-env.sh
RUN sed -i "/^export JAVA_HOME/ s:.*:export JAVA_HOME=${JAVA_HOME}\nexport HADOOP_HOME=${HADOOP_HOME}\nexport HADOOP_PREFIX=${HADOOP_PREFIX}:" ${HADOOP_HOME}/etc/hadoop/hadoop-env.sh
RUN sed -i '/^export HADOOP_CONF_DIR/ s:.*:export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop/:' $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh

WORKDIR $HADOOP_HOME

ADD bootstrap.sh /etc/bootstrap.sh

CMD ["/etc/bootstrap.sh", "-d"]

The required files are located in Hadoop

Create bootstrap script and Config files

bootstrap.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/bin/bash

: ${HADOOP_PREFIX:=/usr/local/hadoop}

$HADOOP_PREFIX/etc/hadoop/hadoop-env.sh

rm /tmp/*.pid

# installing libraries if any - (resource urls added comma separated to the ACP system variable)
cd $HADOOP_PREFIX/share/hadoop/common ; for cp in ${ACP//,/ }; do echo == $cp; curl -LO $cp ; done; cd -

# replace config files if provided
cp /mnt/hadoop-config/* $HADOOP_PREFIX/etc/hadoop/

# start sshd
/usr/sbin/sshd

if [[ $1 == "-d" ]]; then
while true; do sleep 1000; done
fi

if [[ $1 == "-bash" ]]; then
/bin/bash
fi

core-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:8020</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:8020</value>
</property>
</configuration>

hdfs-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop-slave1:50090</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop-master:50070</value>
</property>

<property>
<name>dfs.datanode.max.transfer.threads</name>
<value>8192</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>

mapred-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop-master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop-master:19888</value>
</property>
</configuration>

yarn-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
<?xml version="1.0"?>
<!-- Site specific YARN configuration properties -->
<configuration>
<property>
<name>yarn.application.classpath</name>
<value>/usr/local/hadoop/etc/hadoop, /usr/local/hadoop/share/hadoop/common/*, /usr/local/hadoop/share/hadoop/common/lib/*, /usr/local/hadoop/share/hadoop/hdfs/*, /usr/local/hadoop/share/hadoop/hdfs/lib/*, /usr/local/hadoop/share/hadoop/mapreduce/*, /usr/local/hadoop/share/hadoop/mapreduce/lib/*, /usr/local/hadoop/share/hadoop/yarn/*, /usr/local/hadoop/share/hadoop/yarn/lib/*</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>2</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
</configuration>

slaves

1
2
3
hadoop-slave1
hadoop-slave2
hadoop-slave3

build docker image

1
docker build --tag newnius/hadoop:2.7.4 .

Start Hadoop Cluster

Start master node

1
2
3
4
5
6
7
8
docker service create \
--name hadoop-master \
--network swarm-net \
--hostname hadoop-master \
--constraint node.role==manager \
--replicas 1 \
--endpoint-mode dnsrr \
newnius/hadoop:2.7.4

Start 3 slave nodes

1
2
3
4
5
6
7
docker service create \
--name hadoop-slave1 \
--network swarm-net \
--hostname hadoop-slave1 \
--replicas 1 \
--endpoint-mode dnsrr \
newnius/hadoop:2.7.4
1
2
3
4
5
6
7
docker service create \
--name hadoop-slave2 \
--network swarm-net \
--hostname hadoop-slave2 \
--replicas 1 \
--endpoint-mode dnsrr \
newnius/hadoop:2.7.4
1
2
3
4
5
6
7
docker service create \
--name hadoop-slave3 \
--network swarm-net \
--hostname hadoop-slave3 \
--replicas 1 \
--endpoint-mode dnsrr \
newnius/hadoop:2.7.4

Notice: all the data in containers are not persisted, so they will lose when restarts. browse start_hadoop.sh for full script

start a proxy to access Hadoop web ui

There are so many hosts and ports in Hadoop cluster that it is messy and difficult to expose all of of them. So using a proxy to replace port mapping. The proxy uses protocal of socks5 which supports remote dns so that we can visit by the hostname such as http://hadoop-master:8088 to monitor Hadoop yarn.

It is required to set socks5 proxy in browser.

1
2
3
4
5
6
docker service create \
--replicas 1 \
--name proxy_docker \
--network swarm-net \
-p 7001:7001 \
newnius/docker-proxy

Run WordCount

How to enter containers

__Notice: All the following commands are run in containers.

Execute following command to enter the master nodes

1
2
docker exec -it hadoop-master.1.$(docker service ps \
hadoop-master --no-trunc | tail -n 1 | awk '{print $1}' ) bash

format namenode

In the first time we run Hadoop cluster, it is required to format hdfs in namenode.

1
2
3
4
5
6
7
8
9
10
# stop all Hadoop processes
sbin/stop-yarn.sh
sbin/stop-dfs.sh

# format namenode
bin/hadoop namenode -format

# start yarn and dfs nodes
sbin/start-dfs.sh
sbin/start-yarn.sh

Run a test

1
2
3
# prepare input files to hdfs:///user/root/input
bin/hdfs dfs -mkdir -p /user/root/input
bin/hdfs dfs -put etc/hadoop/* /user/root/input
1
2
3
# Run WordCount
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.4.jar \
wordcount input output

Results

d37911bd8eff36c6c21de51163243658.md.png

45b2676f739576b759d66a8757a51c5a.md.png

8bd35d14cb4aceaf254a7f2a71e48cc0.md.png

2992c1dbcaf94f799e49319a188f4eae.md.png

118b272f53a4722cd01e71b50422a394.md.png

c944731613f850cd5ac9c532a936d92d.md.png

References

Hadoop Cluster Setup

Get Docker CE for Debian | Docker Documentation

Swarm mode overview | Docker Documentation