Data Partitioning in Big Data Application with Apache Hive

Aug 08, 2016Vijay AegisHadoop03

Big data consulting company professionals are introducing the concept of partitioning in big data application. You need to read the post completely to understand how to do partitioning in such app using Apache Hive. If you don’t know how to do it, experts will help

Introduction About Partitioning In Big Data Application

Apache Hadoop is a data framework which can support to process the big data.

In big data application, data partitioning is very important because it will divide the huge data set to many related sub data set by many criteria base on column of data such partition by collect date, country, city… With partitioning, we will have better organization in data to query and analyze data and improve about performance.

In this topic, I will introduce how to do partitioning data in Hive and focus con the example to do the partition data in a big data application by daily to make sure always load the process daily data to partition table.

Environment

Java: JDK 1.7

Cloudera version: CDH4.6.0

Initial steps

We need to prepare some input data file, open a new file in terminal of linux:

vifile1

Text some input data with format: id;name;ticketPrice;collectdate

1;Join;2000;20160730000000

2;Jean;3000;20160731000000

3;Jang;5000;20160730000000

4;Jane;1000;20160731000000

5;Jin;6000;20160730000000

We need to put the local files to Hadoop Distributed File System (HDFS), use this command:

hadoop fs -mkdir -p /data/mysample/mappingTable

hadoop fs -put file1/data/mysample/mappingTable/

Code walk through

We will create a Hive script which loads the data to external table to do partitioning. After that, we will move the data from mapping table to the output table for data analytic which support to query the data by business purpose.

— Create database

create database cloudera_dataPartitionExample;

— Use database

usecloudera_dataPartitionExample;

— Drop the partitionTable

— Create the partitionTable to contain the real data after partitioning

— DELIMITED by ‘;’ in our data and store at the location ‘/data/mysample/partitionTable’

DROP TABLE IF EXISTS partitionTable;

CREATE EXTERNAL TABLE partitionTable

id string,

name string,

ticketPrice string

PARTITIONED BY (collectdate string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'

STORED AS SEQUENCEFILE

LOCATION '/data/mysample/partitionTable';

— Drop the mappingTable

— Create the mappingTable to mapping to data which have not done partitionTable

— DELIMITED by ‘;’ in our data and store at the location ‘/data/mysample/mappingTable’

DROP TABLE IF EXISTS mappingTable;

CREATE EXTERNAL TABLE mappingTable

id string,

name string,

ticketPrice string

collectdate string

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'

STORED AS SEQUENCEFILE

LOCATION '/data/mysample/mappingTable';

SET hive.exec.max.dynamic.partitions.pernode=100000;

SET hive.exec.max.dynamic.partitions=100000;

SET hive.exec.dynamic.partition.mode=nonstrict;

SET hive.exec.dynamic.partition=true;

— Move the data from mappingTable to partitionTable. This query will always move the data to partitionTable with partition column is “collectdate”

INSERT INTO TABLE partitionTable PARTITION (collectdate) SELECT * FROM mappingTable;

— Load all metadata of Hive for partition in this folder to Hive table

MSCK REPAIR TABLE partitionTable;

— Drop mapping table after partitioning

DROP TABLE IF EXISTS mappingTable;

Verify the result

After run the script above, we will check the HDFS and hive table to make sure the data is partition good or not.

Use this command to show the data on HDFS

hadoop fs –ls /data/mysample/partitionTable/

We will see two partition folders:

/data/mysample/partitionTable/collectdate=20160730000000

/data/mysample/partitionTable/collectdate=20160731000000

View the data for each partition folders:

hadoop fs –text /data/mysample/partitionTable/collectdate=20160731000000/* | head –n 10

We will see that in this folder will have two records because it is collected by 20160731:

2;Jean;3000

4;Jane;1000

hadoop fs –text /data/mysample/partitionTable/collectdate=20160730000000/* | head –n 10

We will see that in this folder will have three records because it is collected by 20160730:

3;Jang;5000

1;Join;2000

5;Jin;6000

We can access to Hive client terminal to query the data with the table partitionTablein database cloudera_dataPartitionExample

select * from cloudera_dataPartitionExample.partitionTable;

show create table cloudera_dataPartitionExample.partitionTable;

We can see that we will not have the collect column anymore in our real data because we already partition the data by that column. Now you can see we can query the data very easily by each partition base on collect date column.

Hope that you guys can understand how partitioning important to apply it to our big data application.

Hope you have completely understood the concept of data partitioning using Apache hive. For doubts and queries, you may contact any good big data consulting company and avail related information from experts.

Previous PostHow To Streaming Log File To HDFS Using Flume In Big Data Application Next PostHandling Razor Scripting Attack And SQL Injection In MVC

Vijay Aegis

Vijay is a compulsive blogger who likes to educate like-minded people on various new technologies and trends. He works with Aegis SoftTech as a software developer and has been developing software for years. Stay Connected to him on Facebook and Google+.