This blog is intended by big data service offerings vendor that will going to tell you how to handle header row in Pig. You will also learn the basics of Apache Pig technology in this post. Read further and explore what experts have to say about Pig.
Whenever we talk about BigData, first impression comes in our mind that it can absorb all type of data , be it as Structured, Semi-unstructured and Unstructured data. Unstructured data refers to the data which doesn’t support any fixed schema or not in an organized way.
In late 90’s, Merrill Lynch states that unstructured data will contribute around 80-90% for the organizations important/useful information, moreover data is growing with such a huge pace that IDC predicted that data will grow approximately 40 ZB by the 2020 but there is no such tool or framework built till 2006 which can handle such a wide data and can extract some useful or meaningful information from it.
Then comes Hadoop into picture, hadoop was created by Doug Cutting and team to handle BigData, be it in variety, velocity, volume and varacity, Hadoop has capability to process all of them and the important ecosytem which Doug Cutting and team developed at Yahoo to process/transform/profile/clean data was Pig.
Apache Pig is mainly built to handle any kind of data, it is used to refine or clean the data. For example, Pig can also be used for ETL (Extract, Transformations and Load) operations, whether to find the rows which under specific condition, joining of two different types of datasets based upon a key. It is a dataflow procedure by which the data has to be transformed. It can tackleunorganized schema data.
Eventhough Apache Pig is now a developed ecosytem of Hadoop framework, it still have few mediocre limitations, such as it does not have any option on how to tackle/handle header row of the input file.
Below is the screenshot of the sample CSV file having header in it, we will use this CSV file for our use-case.
Transfer the file from Windows to Linux server with the help of WinScp.
Verify that file has been successfully loaded or not.
Copy the file from Local to HDFS using the below command.
Login to Pig via HDFS option and load the CSV file from the HDFS environment using comma as a delimiter.
Verify the output with the help of DUMP command.
In the below screen shot, we can clearly observe that the output consists of the header row also which is the 1st row in our sample data CSV file.
After some research, we came to the conclusion that we don’t have option to skip the header row at the time of loading into Pig relation.
Although for CSV file, we have one alternate solution explained below.
Register the piggybank.jar file in Pig grunt shell.
Load the same CSV file using org.apache.pig.piggybank.storage.CSVExcelStorage() and SKIP_INPUT_HEADER as an option.
Now see the output using DUMP command.
Below screenshot shows that Header row has been skip using the above method.
But above method can only be useful when we have data in CSV format but we have to manually implement few steps to handle the text file or other format input data as mentioned below.
Below are the steps which one need to follow to skip header row.
Load the input file to the relation ‘input_file’using the traditional way.
- Rank the 1st relation so that it will create rank_input_file (serial number) as a column.
- Filter the 2nd relation (ranked) by the rows greater than value 1 of rank_input_file column.
- Generate all the columns of the residue relation.
Verify the output using DUMP command that header row is skipped.
There is currently no option or clause to skip header row in Pig at the runtime and we have to manually implement steps to skip the header.
The big data service offerings vendor has shared this post to let you know the method to handle header row in Pig. Still if you find anything missing in the post, tell in comments.