Setup a distributed Hadoop/HDFS cluster with docker

Foreword

In this post, you will learn how to quickly steup a distributed Hadoop cluster in docker swarm, and hwo to expose the Web UI to users, how to access HDFS outside the swarm.

You can find the reources (scripts, slides, configuration files etc.) mentioned in this post here. Besides, I have recorded a video on Youtube for this post.

Environment

In this experiment, we use 5 nodes to deploy our Hadoop cluster. The operation system of them is CentOS Linux release 7.6.1810 (Core) and the docker version is Docker version 18.09.1, build 4c52b90.

hostname	alias	ip	role
XL-slave021	hadoop-node021	192.168.1.21	Name Node, Resource Manager
XL-slave022	hadoop-node022	192.168.1.22	Data Node, Secondary NameNode, Node Manager
XL-slave023	hadoop-node023	192.168.1.23	Data Node, Node Manager
XL-slave024	hadoop-node024	192.168.1.24	Data Node, Node Manager
XL-slave025	hadoop-node025	192.168.1.25	Data Node, Node Manager

The alias means the hostname in the Hadoop cluster Due the limitation that we cannot have same (host)name in docker swarm and we may want to deploy other services on the same node, so we’d better choose another name.

Install Docker

If you haven’t installed docker, we have to install docker.

Install docker on all the nodes.

# install latest docker
curl -fsSL https://get.docker.com | sh

# Add current user to docker group to interact with docker
sudo usermod -aG docker $USER

# Start docker
sudo systemctl start docker

Create a docker swarm

Execute on XL-slave021 to create a swarm

1	docker swarm init --listen-addr 192.168.1.21

Create a docker overlay network called hadoop-net

1	docker network create --driver overlay hadoop-net

Execute on the others to add them to the newly created swarm

1	docker swarm join --token SWMTKN-1—THE-LONG-STRING 192.168.1.21:2377

Note: Change the IPs and token to your own

Prepare the environment

Execute on all the nodes to download (ahead) the Hadoop docker image.

1	docker pull newnius/hadoop:2.7.4

Create dir /data if you don’t have it or not writeable

1 2	sudo mkdir -p /data sudo chmod 777 /data

And then execute on all the nodes to create dir for data persist.

1 2	mkdir -p /data/hadoop/hdfs/ mkdir -p /data/hadoop/logs/

Configure Hadoop

Create a dir named config (the name must be config)

1	mkdir config

There are five configuration files to be placed in this dir.

config/
├── core-site.xml
├── hdfs-site.xml
├── mapred-site.xml
├── slaves
└── yarn-site.xml

Refer to this link to access the sample files and update the alias to your own.

Then distribute the configuration files to all the nodes.

scp -r config/ XL-slave021:/data/hadoop/
scp -r config/ XL-slave022:/data/hadoop/
scp -r config/ XL-slave023:/data/hadoop/
scp -r config/ XL-slave024:/data/hadoop/
scp -r config/ XL-slave025:/data/hadoop/

Bring up the nodes of Hadoop

Execute on XL-slave021 to bring up all the nodes

docker service create \
        --name {{ Alias }} \
        --hostname {{ Alias }} \
        --constraint node.hostname=={{ Hostname }} \
        --network hadoop-net \
        --endpoint-mode dnsrr \
        --mount type=bind,src=/data/hadoop/config,dst=/config/hadoop \
        --mount type=bind,src=/data/hadoop/hdfs,dst=/tmp/hadoop-root \
        --mount type=bind,src=/data/hadoop/logs,dst=/usr/local/hadoop/logs \
        newnius/hadoop:2.7.4

Note: Change {\{ Alias }} to hadoop-node021 - hadoop-node025

Note: Change {\{ Hostname }} to XL-slave021 - XL-slave025

Execute 5 time, one for each

Start Hadoop services

Execute on XL-slave021 to enter master node

1 2	docker exec -it hadoop-node021.1.$(docker service ps \ hadoop-node021 --no-trunc \| grep Running \| awk '{print $1}' ) bash

In the container, execute

# Stop all Hadoop processes
sbin/stop-yarn.sh
sbin/stop-dfs.sh

# Format Name Node
bin/hadoop namenode -format

# Start yarn and hdfs service
sbin/start-dfs.sh
sbin/start-yarn.sh

Validate

Let’s validate the setup by running a mapreduce job.

In the container, execute

# Prepare input files to hdfs:///user/root/input
bin/hdfs dfs -mkdir -p /user/root/input
bin/hdfs dfs -put etc/hadoop/* /user/root/input

# Run build-in WordCount
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.4.jar wordcount input output

# View output
bin/hdfs dfs -ls output
bin/hdfs dfs -cat output/part-r-00000 | tail

Cheers !

Now, you have a real distributed Hadoop cluster!

But …

How to access the WebUI ?

How to access HDFS outside the swarm ?

If you don’t care about the problems above, you can just stop here.

Expose the HDFS

We cannot simply publish the ports as it conflicts with endpoint-mode(dnsrr) and there will be some problems in vip mode.

But we can start another container to publish and forward the traffic.

To access HDFS outside the swarm, we have to expose port 50070 of Name Node and 50075 of Data Nodes.

Just execute on XL-slave021 for 5 times to start 5 forwarders.

docker service create \
        --name {{ alias }}-forwarder \
        --constraint node.hostname=={{ hostname }} \
        --network hadoop-net \
        --env REMOTE_HOST={{ alias }} \
        --env REMOTE_PORT={{ port }} \
        --env LOCAL_PORT={{ port }} \
        --publish mode=host,published={{ port }},target={{ port }} \
        newnius/port-forward

Note: Change {\{ Alias }} to hadoop-node021 - hadoop-node025

Note: Change {\{ Hostname }} to XL-slave021 - XL-slave025

Note: {\{ port }} of Name Node is 50070, and port of Data Node is 50075

192.168.1.21  hadoop-node021
192.168.1.22  hadoop-node022
192.168.1.23  hadoop-node023
192.168.1.24  hadoop-node024
192.168.1.25  hadoop-node025

Open a python shell to read data from HDFS

from hdfs import *

client = Client("http://hadoop-node021:50070", root="/")
client.list("/user/root/output")
client.download("/user/root/output","/tmp")

import os
os.system("ls /tmp/")

Expose the WebUI

There are so many nodes and ports to expose, it is not a good idea to expose the ports directly.

Instead, we can provide a proxy so that we can access the WebUI.

Execute on XL-slave021 to start one socks5 proxy

docker service create \
	--name hadoop-proxy \
	--hostname hadoop-proxy \
	--network hadoop-net \
	--replicas 1 \
	--detach=true \
	--publish 7001:7001 \
	newnius/docker-proxy

And set your browser to use this socks5 proxy.

Refer to this page for how to configuring

Open your broswer, and visit http://hadoop-node021:8088 and you shall see the WebUI.

That’s it !

Things you have to know if you use this for production use.

Take care of ssh keys

Generate the ssh files yourself and mount them as them may change when the docker image is rebuilt.

Modify the configuration files

Customize the configuration files in your case, for the default are for presentation.

Just put them under /data/hadoop/config/ and it will replace the origin files if exists.