Abstract
Since the time Hadoop came up, the Hadoop ecosystem is getting larger and larger. There are so many softwares being developed around Hadoop. Apache HBase and Apache Hive are two of them.
In this expriment, for the purpose of learning these two softwares, we use HBase and Hive to continue our reseach on wuxia novels mentioned before.
Introduction
Apache Hive is a data warehouse. Before Hive, if we want to analyze data in HDFS, we have to design many mapreduces to do that. Even if data is stored in a structured format, there is no straight ways to operate on it.
Apache Hive intergrates many mapreduces and encapsulates them as many operations. On top of these, a SQL-like query language called HiveQL is used to make the operations. This is rather easy to learn for those who works on relational data warehouses. We can use HiveQL to query data from HDFS (Hadoop Distributed File System) just like we are using RDBMS with SQL.
Apache Hive runs on the Hadoop cluster, so we have to install it on an existing Hdoop cluster. We still use Docker to create the image of Hive.
Build Docker image
If you don’t want to learn how to write a Dockerfile currently, you can directlly turn to Initialization & Start.
Dockerfile
1 | # Based on Hadoop image |
Bootstrap script
1 |
|
Configuration File
Apache Hive stores the meta data of managed tables such as table definition in metastore based on derby (local mode) or mysql (distributed mode). In local mode, users can only run HiveQL in the node which Hive metastore is installed. To better use Hive, we choose to use mysql as backend store of metastore.
The following configuration files defines the which database to use and the connection information of it.
1 | <configuration> |
build docker image
1 | docker build --tag newnius/hive:2.1.1 . |
Initialization & Start Hive
We configured Hive to use Mysql as the metastore, so start mysql first.
1 | docker service create \ |
Start a Hadoop cluster.
Follow How to quickly setup a Hadoop cluster in Docker to start a Hadoop Cluster.
If you already have one, you have to change the conf files located at /config/hadoop
by mounting.
Start Hive.
Instead of running Hive with Hadoop nodes, we choose to run Hive in a seperate container thus it can be added/removed easily.
1 | docker service create \ |
Notice: all the data in containers are not persisted, so they will lose when restarts. see swarm_start_hive.sh to view full script
Before using Hive, we have to create a directory in HDFS for Hive managed data and initialize the metastore.
On HDFS namenode
1 | hdfs dfs -mkdir /tmp |
On Hive
1 | schematool --dbType mysql --initSchema |
Insert data and Query
press hive
to enter hive shell
Load data to Hive
1 | CREATE TABLE Wuxia ( |
1 | LOAD DATA LOCAL INPATH '/tmp/word_occurrence.txt' INTO TABLE Wuxia; |
you can download the file word_occurrence.txt
Make queries
Query words occured more than 300 times.
1 | SELECT `word` from Wuxia WHERE `count` > 300; |
Query the top 100 hottest words.
1 | SELECT `word`,`count` from Wuxia WHERE `count` > 300 ORDER BY `count` DESC; |
Conclusion
Querying data in HDFS with Apache Hive is very easy and clear. Although the performance of Hive is not remarkable due to limitations of Mapreduce, The ability of being able to process big volume of data can not be ignored.
Compare HBase with Hive, HBase is better for OLTP(OnLine Transaction Processing) while Hive works better in the field of OLAP(OnLine Analysis Processing).
References
Hive安装配置指南(含Hive Metastore三种方式详解)
hive 部署需要注意的几点以及Version information not found 错误解决办法
xml文件中配置JDBC源遇到问题 : The reference to entity “characterEncoding” must end with the ‘;’ delimiter
AuthorizationException: User not allowed to impersonate User