Big data consulting company professionals are introducing the concept of partitioning in big data application. You need to read the post completely to understand how to do partitioning in such app using Apache Hive. If you don’t know how to do it, experts will help
Introduction About Partitioning In Big Data Application
Apache Hadoop is a data framework which can support to process the big data.
In big data application, data partitioning is very important because it will divide the huge data set to many related sub data set by many criteria base on column of data such partition by collect date, country, city… With partitioning, we will have better organization in data to query and analyze data and improve about performance.
In this topic, I will introduce how to do partitioning data in Hive and focus con the example to do the partition data in a big data application by daily to make sure always load the process daily data to partition table.
Environment
Java: JDK 1.7
Cloudera version: CDH4.6.0
Initial steps
- We need to prepare some input data file, open a new file in terminal of linux:
vifile1
Text some input data with format: id;name;ticketPrice;collectdate
1;Join;2000;20160730000000
2;Jean;3000;20160731000000
3;Jang;5000;20160730000000
4;Jane;1000;20160731000000
5;Jin;6000;20160730000000
- We need to put the local files to Hadoop Distributed File System (HDFS), use this command:
hadoop fs -mkdir -p /data/mysample/mappingTable
hadoop fs -put file1/data/mysample/mappingTable/
Code walk through
We will create a Hive script which loads the data to external table to do partitioning. After that, we will move the data from mapping table to the output table for data analytic which support to query the data by business purpose.
— Create database
create database cloudera_dataPartitionExample;
— Use database
usecloudera_dataPartitionExample;
— Drop the partitionTable
— Create the partitionTable to contain the real data after partitioning
— DELIMITED by ‘;’ in our data and store at the location ‘/data/mysample/partitionTable’
DROP TABLE IF EXISTS partitionTable;
CREATE EXTERNAL TABLE partitionTable
(
id string,
name string,
ticketPrice string
)
PARTITIONED BY (collectdate string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'
STORED AS SEQUENCEFILE
LOCATION '/data/mysample/partitionTable';
— Drop the mappingTable
— Create the mappingTable to mapping to data which have not done partitionTable
— DELIMITED by ‘;’ in our data and store at the location ‘/data/mysample/mappingTable’
DROP TABLE IF EXISTS mappingTable;
CREATE EXTERNAL TABLE mappingTable
(
id string,
name string,
ticketPrice string
collectdate string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'
STORED AS SEQUENCEFILE
LOCATION '/data/mysample/mappingTable';
SET hive.exec.max.dynamic.partitions.pernode=100000;
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true;
— Move the data from mappingTable to partitionTable. This query will always move the data to partitionTable with partition column is “collectdate”
INSERT INTO TABLE partitionTable PARTITION (collectdate) SELECT * FROM mappingTable;
— Load all metadata of Hive for partition in this folder to Hive table
MSCK REPAIR TABLE partitionTable;
— Drop mapping table after partitioning
DROP TABLE IF EXISTS mappingTable;
Verify the result
After run the script above, we will check the HDFS and hive table to make sure the data is partition good or not.
- Use this command to show the data on HDFS
hadoop fs –ls /data/mysample/partitionTable/
We will see two partition folders:
/data/mysample/partitionTable/collectdate=20160730000000
/data/mysample/partitionTable/collectdate=20160731000000
- View the data for each partition folders:
hadoop fs –text /data/mysample/partitionTable/collectdate=20160731000000/* | head –n 10
We will see that in this folder will have two records because it is collected by 20160731:
2;Jean;3000
4;Jane;1000
hadoop fs –text /data/mysample/partitionTable/collectdate=20160730000000/* | head –n 10
We will see that in this folder will have three records because it is collected by 20160730:
3;Jang;5000
1;Join;2000
5;Jin;6000
- We can access to Hive client terminal to query the data with the table partitionTablein database cloudera_dataPartitionExample
select * from cloudera_dataPartitionExample.partitionTable;
show create table cloudera_dataPartitionExample.partitionTable;
We can see that we will not have the collect column anymore in our real data because we already partition the data by that column. Now you can see we can query the data very easily by each partition base on collect date column.
Hope that you guys can understand how partitioning important to apply it to our big data application.
Hope you have completely understood the concept of data partitioning using Apache hive. For doubts and queries, you may contact any good big data consulting company and avail related information from experts.