Foreword
In this post, you will learn how to quickly steup a distributed Hadoop cluster in docker swarm, and hwo to expose the Web UI to users, how to access HDFS outside the swarm.
You can find the reources (scripts, slides, configuration files etc.) mentioned in this post here. Besides, I have recorded a video on Youtube for this post.
Environment
In this experiment, we use 5 nodes to deploy our Hadoop cluster. The operation system of them is CentOS Linux release 7.6.1810 (Core)
and the docker version is Docker version 18.09.1, build 4c52b90
.
hostname | alias | ip | role |
---|---|---|---|
XL-slave021 | hadoop-node021 | 192.168.1.21 | Name Node, Resource Manager |
XL-slave022 | hadoop-node022 | 192.168.1.22 | Data Node, Secondary NameNode, Node Manager |
XL-slave023 | hadoop-node023 | 192.168.1.23 | Data Node, Node Manager |
XL-slave024 | hadoop-node024 | 192.168.1.24 | Data Node, Node Manager |
XL-slave025 | hadoop-node025 | 192.168.1.25 | Data Node, Node Manager |
- The alias means the hostname in the Hadoop cluster Due the limitation that we cannot have same (host)name in docker swarm and we may want to deploy other services on the same node, so we’d better choose another name.
Install Docker
If you haven’t installed docker, we have to install docker.
Install docker on all the nodes.
1 | # install latest docker |
Create a docker swarm
Execute on XL-slave021
to create a swarm
1 | docker swarm init --listen-addr 192.168.1.21 |
Create a docker overlay network called hadoop-net
1 | docker network create --driver overlay hadoop-net |
Execute on the others to add them to the newly created swarm
1 | docker swarm join --token SWMTKN-1—THE-LONG-STRING 192.168.1.21:2377 |
Note: Change the IPs and token to your own
Prepare the environment
Execute on all the nodes to download (ahead) the Hadoop docker image.
1 | docker pull newnius/hadoop:2.7.4 |
Create dir /data
if you don’t have it or not writeable
1 | sudo mkdir -p /data |
And then execute on all the nodes to create dir for data persist.
1 | mkdir -p /data/hadoop/hdfs/ |
Configure Hadoop
Create a dir named config
(the name must be config
)
1 | mkdir config |
There are five configuration files to be placed in this dir.
1 | config/ |
Refer to this link to access the sample files and update the alias
to your own.
Then distribute the configuration files to all the nodes.
1 | scp -r config/ XL-slave021:/data/hadoop/ |
Bring up the nodes of Hadoop
Execute on XL-slave021 to bring up all the nodes
1 | docker service create \ |
Note: Change {\{ Alias }} to hadoop-node021
- hadoop-node025
Note: Change {\{ Hostname }} to XL-slave021
- XL-slave025
Execute 5 time, one for each
Start Hadoop services
Execute on XL-slave021 to enter master node
1 | docker exec -it hadoop-node021.1.$(docker service ps \ |
In the container, execute
1 | # Stop all Hadoop processes |
Validate
Let’s validate the setup by running a mapreduce job.
In the container, execute
1 | # Prepare input files to hdfs:///user/root/input |
Cheers !
Now, you have a real distributed Hadoop cluster!
But …
How to access the WebUI ?
How to access HDFS outside the swarm ?
If you don’t care about the problems above, you can just stop here.
Expose the HDFS
We cannot simply publish the ports as it conflicts with endpoint-mode(dnsrr) and there will be some problems in vip mode.
But we can start another container to publish and forward the traffic.
To access HDFS outside the swarm, we have to expose port 50070 of Name Node and 50075 of Data Nodes.
Just execute on XL-slave021 for 5 times to start 5 forwarders.
1 | docker service create \ |
Note: Change {\{ Alias }} to hadoop-node021
- hadoop-node025
Note: Change {\{ Hostname }} to XL-slave021
- XL-slave025
Note: {\{ port }} of Name Node is 50070, and port of Data Node is 50075
Login to another node outside the swarm, and add following lines to /etc/hosts
1 | 192.168.1.21 hadoop-node021 |
Open a python shell to read data from HDFS
1 | from hdfs import * |
Expose the WebUI
There are so many nodes and ports to expose, it is not a good idea to expose the ports directly.
Instead, we can provide a proxy so that we can access the WebUI.
Execute on XL-slave021 to start one socks5 proxy
1 | docker service create \ |
And set your browser to use this socks5 proxy.
Refer to this page for how to configuring
Open your broswer, and visit http://hadoop-node021:8088 and you shall see the WebUI.
That’s it !
Things you have to know if you use this for production use.
Take care of ssh keys
Generate the ssh files yourself and mount them as them may change when the docker image is rebuilt.
Modify the configuration files
Customize the configuration files in your case, for the default are for presentation.
Just put them under /data/hadoop/config/ and it will replace the origin files if exists.