In this post, hadoop big data services vendor shares experience of dealing with one of the bugs left behind by Big data ecosystems. This post is about the bug in Hue, named as Oozie-sqoop workflow. For more info, read this blog post further.
The BigData era has given rise to many different applications in technology world and latest amongst them is Hadoop for capturing, managing and analyzing large sets of structured and unstructured data. Many different research shows that volume of data is growing rapidly and will be reach close to 4 ZB (Zettabytes) by the end of this year.
As of now Hadoop evolved as a matured framework and many organizations are using it for all their Big Data strategic planning including event, social, web, spatial, and sensor data.
Hadoop is mainly build where one can store the data as well as process it, which are
- Hadoop Distributed File System (HDFS) is the file system which enables large amounts of structured and unstructured data to be stored and quickly accessed across large server clusters.
- MapReduce is the data processing model build on Java.
As the Big Data ecosystems are transforming at a fast pace to cope up with the requirements, there are bugs which are left behind by them, we have encountered with one of the bug in Hue – Oozie – Sqoop Workflow, thought of sharing with you.
If you want to import data from Oracle into HDFS through Sqoop, you would be using Sqoop command from command prompt as shown below.
Sqoop import \--connect jdbc:oracle:thin:@ipaddress:portnumber/dbname\--username ***** \--password ***** \--query "select dt_dim_id, calendar_dt, calendar_yr, calendar_mth, calendar_mth_desc from date_dim where \$CONDITIONS" \-m 1 \--target-dir "hdfs://ipaddress:8020/user/sqoop/sqoop_result"
But what if you want to load the data into HDFS in regular time interval, then you need to schedule this Sqoop command thru Oozie.
Best way to setup Oozie job is thru Hue – It is a web interface for interacting with different Hadoop ecosystems such as Hive, Pig, Oozie etc.
When executing Oozie-Sqoop job thru Hue, job got Killed with the error message shown in below screenshots.
More detailed description was been captured in Logs shown below.
If you are not able to see the error message than please find attached log file and kindly search for “[main] ERROR”.
We can say this is the limitation of the Hue.
The “–query” statement (enclosed in quotes) in Sqoop command results in each portion of the query to be interpreted as unrecognized arguments, as shown in the above error message, it actually doesn’t Captured the quotes in the command and hence unable to recognize the word after SELECT which are present in quotes.
Go with the CLI option rather than Hue because this is limitation of Hue, prepare workflow.xml(as shown in below screenshot)to schedule a job thru Ooziein such a manner that break the whole command into arguments, starting with “import” as the first argument, enter each part of the command as a separate argument like in attached document.
If you are not able to see the above XML than please find attached file for the content.
Submit the Oozie job as mentioned in below command:
$ oozie job -oozie http://localhost:11000/oozie -config examples/apps/map-reduce/job.properties -run
Note: Here “examples/apps/map-reduce/job.properties” is the path of the job.properties file, you need to change it accordingly with your location path.
We have to consider few limitations of the Hue while preparing Oozie workflow thru it and one will definitely get error if schedule Sqoop command thru Oozie which has query option under it.
It’s better to go with CLI option and prepare workflow.xml file with ‘<arg>’ tag under it.
This article is about one of the bugs of Big data ecosystems – Oozie-sqoop workflow. Hadoop big data vendor has shared their experience of dealing with this bug. If you have any question, ask them freely.