How to quickly setup a Hadoop cluster in Docker

Althought Hadoop can be installed in Single-Node mode, the best way to learn Hadoop is setup a distributed cluster as it is in production environment. It is not easy to setup a Hadoop cluster in distributed mode in single personal computer, so docker comes out.

Docker is a virtualization technique based on linux namespace and cgroup, but it comsumes less extra resources compared with other virtualization techniques such as VirtualBox.

Docker provides the ability to package and run an application in a loosely isolated environment called a container. The isolation and security allow you to run many containers simultaneously on a given host. Containers are lightweight because they don’t need the extra load of a hypervisor, but run directly within the host machine’s kernel. This means you can run more containers on a given hardware combination than if you were using virtual machines. You can even run Docker containers within host machines that are actually virtual machines!

With the help of swarm, it is rather convenient to setup a cluster which has the ability of service discovery, load balance, retry on failure, fast deploy and so on.

Here is how to quickly setup a distributed Hadoop cluster in docker.

Install Docker (Requires 1.13 or later)

1
2
3
4
5
6
7
8
# install newest version
curl -fsSL https://get.docker.com | bash

# add current user to docker group so that it can execute docker commands
sudo usermod -aG docker $USER

# install bash-completion to support bash completion (optional)
sudo apt install bash-completion

log out current session (log out desktop if GUI) and re-login

1
2
3
4
5
# init a docker swarm cluster and listens on localhost
docker swarm init --advertise-addr 127.0.0.1

# create an overlay network
docker network create --driver overlay swarm-net

Create Dockerfile

If you don’t want to learn how to write a Dockerfile currently, you can directlly turn to Start Hadoop Cluster.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
FROM alpine:3.6

MAINTAINER Newnius <newnius.cn@gmail.com>

# use root directlly as there is no security issues in containers (use root or not)
USER root

# install required packages
RUN apk add --no-cache openssh openssl openjdk8-jre rsync bash procps

# set JAVA_HOME
ENV JAVA_HOME /usr/lib/jvm/java-1.8-openjdk
ENV PATH $PATH:$JAVA_HOME/bin

# configure passwordless SSH
RUN ssh-keygen -q -N "" -t dsa -f /etc/ssh/ssh_host_dsa_key
RUN ssh-keygen -q -N "" -t rsa -f /etc/ssh/ssh_host_rsa_key
RUN ssh-keygen -q -N "" -t rsa -f /root/.ssh/id_rsa
RUN cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys

ADD ssh_config /root/.ssh/config
RUN chmod 600 /root/.ssh/config
RUN chown root:root /root/.ssh/config

RUN echo "Port 2122" >> /etc/ssh/sshd_config

# install Hadoop
RUN wget -O hadoop.tar.gz https://archive.apache.org/dist/hadoop/common/hadoop-2.7.4/hadoop-2.7.4.tar.gz && \
tar -xzf hadoop.tar.gz -C /usr/local/ && rm hadoop.tar.gz

# create a soft link to make it transparent when upgrade Hadoop
RUN ln -s /usr/local/hadoop-2.7.4 /usr/local/hadoop

# set Hadoop enviroments
ENV HADOOP_HOME /usr/local/hadoop
ENV PATH $PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

ENV HADOOP_PREFIX $HADOOP_HOME
ENV HADOOP_COMMON_HOME $HADOOP_HOME
ENV HADOOP_HDFS_HOME $HADOOP_HOME
ENV HADOOP_MAPRED_HOME $HADOOP_HOME
ENV HADOOP_YARN_HOME $HADOOP_HOME
ENV HADOOP_CONF_DIR $HADOOP_HOME/etc/hadoop
ENV YARN_CONF_DIR $HADOOP_PREFIX/etc/hadoop

# add default config files which has one master and three slaves
ADD core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml
ADD hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml
ADD mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml
ADD yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml
ADD slaves $HADOOP_HOME/etc/hadoop/slaves

# update JAVA_HOME and HADOOP_CONF_DIR in hadoop-env.sh
RUN sed -i "/^export JAVA_HOME/ s:.*:export JAVA_HOME=${JAVA_HOME}\nexport HADOOP_HOME=${HADOOP_HOME}\nexport HADOOP_PREFIX=${HADOOP_PREFIX}:" ${HADOOP_HOME}/etc/hadoop/hadoop-env.sh
RUN sed -i '/^export HADOOP_CONF_DIR/ s:.*:export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop/:' $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh

WORKDIR $HADOOP_HOME

ADD bootstrap.sh /etc/bootstrap.sh

CMD ["/etc/bootstrap.sh", "-d"]

Create bootstrap script and Config files

bootstrap.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/bin/bash

: ${HADOOP_PREFIX:=/usr/local/hadoop}

$HADOOP_PREFIX/etc/hadoop/hadoop-env.sh

rm /tmp/*.pid

# installing libraries if any - (resource urls added comma separated to the ACP system variable)
cd $HADOOP_PREFIX/share/hadoop/common ; for cp in ${ACP//,/ }; do echo == $cp; curl -LO $cp ; done; cd -

# replace config files if provided
cp /mnt/hadoop-config/* $HADOOP_PREFIX/etc/hadoop/

# start sshd
/usr/sbin/sshd

## stop all in case master starts far behind
$HADOOP_PREFIX/sbin/stop-yarn.sh
$HADOOP_PREFIX/sbin/stop-dfs.sh

$HADOOP_PREFIX/sbin/start-dfs.sh
$HADOOP_PREFIX/sbin/start-yarn.sh
$HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh start historyserver

if [[ $1 == "-d" ]]; then
while true; do sleep 1000; done
fi

if [[ $1 == "-bash" ]]; then
/bin/bash
fi

core-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:8020</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:8020</value>
</property>
</configuration>

hdfs-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop-slave1:50090</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop-master:50070</value>
</property>

<property>
<name>dfs.datanode.max.transfer.threads</name>
<value>8192</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>

mapred-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop-master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop-master:19888</value>
</property>
</configuration>

yarn-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
<?xml version="1.0"?>
<!-- Site specific YARN configuration properties -->
<configuration>
<property>
<name>yarn.application.classpath</name>
<value>/usr/local/hadoop/etc/hadoop, /usr/local/hadoop/share/hadoop/common/*, /usr/local/hadoop/share/hadoop/common/lib/*, /usr/local/hadoop/share/hadoop/hdfs/*, /usr/local/hadoop/share/hadoop/hdfs/lib/*, /usr/local/hadoop/share/hadoop/mapreduce/*, /usr/local/hadoop/share/hadoop/mapreduce/lib/*, /usr/local/hadoop/share/hadoop/yarn/*, /usr/local/hadoop/share/hadoop/yarn/lib/*</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>2</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
</configuration>

slaves

1
2
3
hadoop-slave1
hadoop-slave2
hadoop-slave3

build docker image

1
docker build --tag newnius/hadoop:2.7.4 .

Start Hadoop Cluster

Start master node

1
2
3
4
5
6
7
docker service create \
--name hadoop-master \
--network swarm-net \
--hostname hadoop-master \
--replicas 1 \
--endpoint-mode dnsrr \
newnius/hadoop:2.7.4

Start 3 slave nodes

1
2
3
4
5
6
7
docker service create \
--name hadoop-slave1 \
--network swarm-net \
--hostname hadoop-slave1 \
--replicas 1 \
--endpoint-mode dnsrr \
newnius/hadoop:2.7.4
1
2
3
4
5
6
7
docker service create \
--name hadoop-slave2 \
--network swarm-net \
--hostname hadoop-slave2 \
--replicas 1 \
--endpoint-mode dnsrr \
newnius/hadoop:2.7.4
1
2
3
4
5
6
7
docker service create \
--name hadoop-slave3 \
--network swarm-net \
--hostname hadoop-slave3 \
--replicas 1 \
--endpoint-mode dnsrr \
newnius/hadoop:2.7.4

Notice: all the data in containers are not persisted, so they will lose when restarts. see hadoop.sh to view full script

start a proxy to access Hadoop web ui

There are so many hosts and ports in Hadoop cluster that it is messy and difficult to expose all of of them. So using a proxy to replace port mapping. The proxy uses protocal of socks5 which supports remote dns so that we can visit by the hostname such as http://hadoop-master:8088 to monitor Hadoop yarn.

It is required to set socks5 proxy in browser.

1
2
3
4
5
6
docker service create \
--replicas 1 \
--name proxy_docker \
--network swarm-net \
-p 7001:7001 \
newnius/docker-proxy

Run WordCount

How to enter containers

Notice: All the following commands are run in containers. Use docker exec -it service-name.x.xxxxx bash to enter.

For example, if we want to enter hadoop-master

1
docker exec -it hadoop-master /* no space here, and then press <tab>, the full name will appear */ bash

format namenode

In the first time we run Hadoop cluster, it is required to format hdfs in namenode.

1
2
3
4
5
6
7
8
9
10
11
12
13
# stop all Hadoop processes
sbin/stop-yarn.sh
sbin/stop-dfs.sh

# remove old files of hdfs in host filesystem in all nodes
rm -rf /tmp

# format namenode
bin/hadoop namenode -format

# start yarn and dfs nodes
sbin/start-dfs.sh
sbin/start-yarn.sh

Run a test

1
2
3
# prepare input files to hdfs:///user/root/input
bin/hdfs dfs -mkdir -p /user/root/input
bin/hdfs dfs -put etc/hadoop/* /user/root/input
1
2
# Run WordCount
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.4.jar wordcount input output

Results

  • Nodes
    Nodes

  • HDFS
    HDFS

  • Run WordCount
    Run WordCount

  • View Results
    View Results

  • Applications
    Applications

  • Detail
    Detail

References

Hadoop Cluster Setup

sequenceiq/hadoop-docker - Docker Hub

Get Docker CE for Debian | Docker Documentation

Swarm mode overview | Docker Documentation