Search
Monday 17 June 2019
  • :
  • :

Data Partitioning in Big Data Application with Apache Hive

Big data consulting company professionals are introducing the concept of partitioning in big data application. You need to read the post completely to understand how to do partitioning in such app using Apache Hive. If you don’t know how to do it, experts will help

Introduction About Partitioning In Big Data Application

Apache Hadoop is a data framework which can support to process the big data.

In big data application, data partitioning is very important because it will divide the huge data set to many related sub data set by many criteria base on column of data such partition by collect date, country, city… With partitioning, we will have better organization in data to query and analyze data and improve about performance.

big-data-hive

In this topic, I will introduce how to do partitioning data in Hive and focus con the example to do the partition data in a big data application by daily to make sure always load the process daily data to partition table.

Environment

Java: JDK 1.7

Cloudera version:  CDH4.6.0

Initial steps

  1. We need to prepare some input data file, open a new file in terminal of linux:

vifile1

Text some input data with format: id;name;ticketPrice;collectdate
1;Join;2000;20160730000000
2;Jean;3000;20160731000000
3;Jang;5000;20160730000000
4;Jane;1000;20160731000000
5;Jin;6000;20160730000000
  1. We need to put the local files to Hadoop Distributed File System (HDFS), use this command:
hadoop fs -mkdir -p /data/mysample/mappingTable
hadoop fs -put file1/data/mysample/mappingTable/

Code walk through

We will create a Hive script which loads the data to external table to do partitioning. After that, we will move the data from mapping table to the output table for data analytic which support to query the data by business purpose.

— Create database

create database cloudera_dataPartitionExample; 

— Use database

usecloudera_dataPartitionExample; 

— Drop the partitionTable

— Create the partitionTable to contain the real data after partitioning

— DELIMITED by ‘;’ in our data and store at the location ‘/data/mysample/partitionTable’

DROP TABLE IF EXISTS partitionTable;
CREATE EXTERNAL TABLE partitionTable
(
id string, 
name string, 
ticketPrice string
)
PARTITIONED BY (collectdate string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'
STORED AS SEQUENCEFILE
LOCATION '/data/mysample/partitionTable'; 

— Drop the mappingTable

— Create the mappingTable to mapping to data which have not done partitionTable

— DELIMITED by ‘;’ in our data and store at the location ‘/data/mysample/mappingTable’

DROP TABLE IF EXISTS mappingTable;
CREATE EXTERNAL TABLE mappingTable
(
id string, 
name string, 
ticketPrice string
collectdate string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'
STORED AS SEQUENCEFILE
LOCATION '/data/mysample/mappingTable';
 
SET hive.exec.max.dynamic.partitions.pernode=100000;
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true; 

— Move the data from mappingTable to partitionTable. This query will always move the data to partitionTable with partition column is “collectdate”

INSERT INTO TABLE partitionTable PARTITION (collectdate) SELECT * FROM mappingTable; 

— Load all metadata of Hive for partition in this folder to Hive table

MSCK REPAIR TABLE partitionTable; 

— Drop mapping table after partitioning

DROP TABLE IF EXISTS mappingTable;

Verify the result

After run the script above, we will check the HDFS and hive table to make sure the data  is partition good or not.

  1. Use this command to show the data on HDFS
hadoop fs –ls /data/mysample/partitionTable/

We will see two partition folders:

/data/mysample/partitionTable/collectdate=20160730000000
/data/mysample/partitionTable/collectdate=20160731000000
  1. View the data for each partition folders:
hadoop fs –text /data/mysample/partitionTable/collectdate=20160731000000/* | head –n 10

We will see that in this folder will have two records because it is collected by 20160731:

2;Jean;3000

4;Jane;1000

hadoop fs –text /data/mysample/partitionTable/collectdate=20160730000000/* | head –n 10

We will see that in this folder will have three records because it is collected by 20160730:

3;Jang;5000

1;Join;2000

5;Jin;6000

  1. We can access to Hive client terminal to query the data with the table partitionTablein database cloudera_dataPartitionExample
select * from cloudera_dataPartitionExample.partitionTable;
show create table cloudera_dataPartitionExample.partitionTable;

We can see that we will not have the collect column anymore in our real data because we already partition the data by that column. Now you can see we can query the data very easily by each partition base on collect date column.

Hope that you guys can understand how partitioning important to apply it to our big data application.

Hope you have completely understood the concept of data partitioning using Apache hive. For doubts and queries, you may contact any good big data consulting company and avail related information from experts.



Vijay is a compulsive blogger who likes to educate like-minded people on various new technologies and trends. He works with Aegis SoftTech as a software developer and has been developing software for years. Stay Connected to him on Facebook and Google+.