This post mainly aims to show you how to build a docker image of Hadoop and how to setup a distributed Hadoop cluster (only) for experiment use (even on single machine).
Althought Hadoop can be installed in Single-Node mode, the best way to learn Hadoop is setup a distributed cluster as it is in production environment. It is not easy to setup a Hadoop cluster in distributed mode in single personal computer, so docker comes out.
Docker is a virtualization technique based on linux namespace and cgroup, but it comsumes less extra resources compared with other virtualization techniques such as VirtualBox.
Docker provides the ability to package and run an application in a loosely isolated environment called a container. The isolation and security allow you to run many containers simultaneously on a given host. Containers are lightweight because they don’t need the extra load of a hypervisor, but run directly within the host machine’s kernel. This means you can run more containers on a given hardware combination than if you were using virtual machines. You can even run Docker containers within host machines that are actually virtual machines!
With the help of swarm, it is rather convenient to setup a cluster which has the ability of service discovery, load balance, retry on failure, fast deploy and so on.
Here is how to quickly setup a distributed Hadoop cluster in docker.
Install Docker (Requires 1.13 or later)
1 2 3 4 5 6 7 8
# install newest version curl -fsSL https://get.docker.com | bash
# add current user to docker group so that it can execute docker commands sudo usermod -aG docker $USER
# Start docker sudo systemctl start docker
logout current session and re-login to take effect.
1 2 3 4 5
# init a docker swarm cluster and listens on localhost docker swarm init --advertise-addr 127.0.0.1
# add default config files which has one master and three slaves ADD core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml ADD hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml ADD mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml ADD yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml ADD slaves $HADOOP_HOME/etc/hadoop/slaves
# update JAVA_HOME and HADOOP_CONF_DIR in hadoop-env.sh RUN sed -i "/^export JAVA_HOME/ s:.*:export JAVA_HOME=${JAVA_HOME}\nexport HADOOP_HOME=${HADOOP_HOME}\nexport HADOOP_PREFIX=${HADOOP_PREFIX}:" ${HADOOP_HOME}/etc/hadoop/hadoop-env.sh RUN sed -i '/^export HADOOP_CONF_DIR/ s:.*:export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop/:' $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh
# installing libraries if any - (resource urls added comma separated to the ACP system variable) cd$HADOOP_PREFIX/share/hadoop/common ; forcpin${ACP//,/ }; doecho == $cp; curl -LO $cp ; done; cd -
# replace config files if provided cp /mnt/hadoop-config/* $HADOOP_PREFIX/etc/hadoop/
# start sshd /usr/sbin/sshd
if [[ $1 == "-d" ]]; then whiletrue; dosleep 1000; done fi
if [[ $1 == "-bash" ]]; then /bin/bash fi
Do not forget to chmod +x bootstrap.sh to make it executable
core-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop-master:8020</value> </property> <property> <name>fs.default.name</name> <value>hdfs://hadoop-master:8020</value> </property> </configuration>
Notice: all the data in containers are not persisted, so they will lose when restarts. browse start_hadoop.sh for full script
start a proxy to access Hadoop web ui
There are so many hosts and ports in Hadoop cluster that it is messy and difficult to expose all of of them. So using a proxy to replace port mapping. The proxy uses protocal of socks5 which supports remote dns so that we can visit by the hostname such as http://hadoop-master:8088 to monitor Hadoop yarn.