Since the time Hadoop came up, the Hadoop ecosystem is getting larger and larger. There are so many softwares being developed around Hadoop. Apache HBase and Apache Hive are two of them.
In this expriment, for the purpose of learning these two softwares, we use HBase and Hive to continue our reseach on wuxia novels mentioned before.
Apache Hive is a data warehouse. Before Hive, if we want to analyze data in HDFS, we have to design many mapreduces to do that. Even if data is stored in a structured format, there is no straight ways to operate on it.
Apache Hive intergrates many mapreduces and encapsulates them as many operations. On top of these, a SQL-like query language called HiveQL is used to make the operations. This is rather easy to learn for those who works on relational data warehouses. We can use HiveQL to query data from HDFS (Hadoop Distributed File System) just like we are using RDBMS with SQL.
Apache Hive runs on the Hadoop cluster, so we have to install it on an existing Hdoop cluster. We still use Docker to create the image of Hive.
If you don’t want to learn how to write a Dockerfile currently, you can directlly turn to Initialization & Start.
# Based on Hadoop image
Apache Hive stores the meta data of managed tables such as table definition in metastore based on derby (local mode) or mysql (distributed mode). In local mode, users can only run HiveQL in the node which Hive metastore is installed. To better use Hive, we choose to use mysql as backend store of metastore.
The following configuration files defines the which database to use and the connection information of it.
docker build --tag newnius/hive:2.1.1 .
We configured Hive to use Mysql as the metastore, so start mysql first.
docker service create \
Start a Hadoop cluster.
Follow How to quickly setup a Hadoop cluster in Docker to start a Hadoop Cluster.
If you already have one, you have to change the conf files located at
/config/hadoop by mounting.
Instead of running Hive with Hadoop nodes, we choose to run Hive in a seperate container thus it can be added/removed easily.
docker service create \
Notice: all the data in containers are not persisted, so they will lose when restarts. see swarm_start_hive.sh to view full script
Before using Hive, we have to create a directory in HDFS for Hive managed data and initialize the metastore.
On HDFS namenode
hdfs dfs -mkdir /tmp
schematool --dbType mysql --initSchema
hive to enter hive shell
CREATE TABLE Wuxia (
LOAD DATA LOCAL INPATH '/tmp/word_occurrence.txt' INTO TABLE Wuxia;
you can download the file word_occurrence.txt
Query words occured more than 300 times.
SELECT `word` from Wuxia WHERE `count` > 300;
Query the top 100 hottest words.
SELECT `word`,`count` from Wuxia WHERE `count` > 300 ORDER BY `count` DESC;
Querying data in HDFS with Apache Hive is very easy and clear. Although the performance of Hive is not remarkable due to limitations of Mapreduce, The ability of being able to process big volume of data can not be ignored.
Compare HBase with Hive, HBase is better for OLTP(OnLine Transaction Processing) while Hive works better in the field of OLAP(OnLine Analysis Processing).