Experts of big data hadoop architect introduce the Avro data format and its use in hadoop development and big data apps. You can go through the article and learn the concept of Avro data format as explained by professionals.
Introduction To Avro data format
Avro is data serialization system. It provides fast, binary data format on schema. When we read the Avro format, schema always used to write the data. This makes the serialization very fast with small time.
The input data with Avro is defined format data type so it will help other developers ensure the input format. When we store the data with Avro, the schema file will be stored at the same time so that files can be process with other program from other developers.
Environment
Java: JDK 1.7
Avro tool: http://www.us.apache.org/dist/avro/avro-1.7.4/java/avro-tools-1.7.4.jar
Cloudera version: CDH4.6.0
Initial steps
- We need to prepare Avro tool at the above download link.
- We need to prepare some input data in JSON format to convert to Avro binary data and convert from Avro binary data to JSON.schemaFile.avsc file is schema file for Avro, this is sample avsc file:
{
"type" : "record",
"name" : "twitter_schema",
"namespace" : "com.hn.avro",
"fields" : [ {
"name" : "id",
"type" : "string",
"doc" : "id of user"
}, {
"name" : "name",
"type" : "string",
"doc" : "name of user"
}, {
"name" : "timestamp",
"type" : "long",
"doc" : "Unix epoch time in seconds"
} ],
"doc:" : "Basic information of user"
}
basicInfor.json file is the real data file which will be converted to Avro with Avro tool:
{"id":"1","name":"Jean","timestamp": 1366123856 }
{"id":"2","name":"Jang","timestamp": 1366489544 }
3. Put files to HDFS with the command
hadoop fs -mkdir -p /data/mysample/avroTable hadoop fs -put basicInfor.json/data/mysample/avroTable hadoop fs -put schemaFile.avsc/data/mysample/avroTableFormat
Code walk through
This command will convert the data from json data to Avro format.
java -jar avro-tools-1.7.4.jar fromjson --schema-file schemaFile.avscbasicInfor.json>basicInfor.avro
These commands will convert the data from avro format to JSON file format for both compression data and text data:
java -jar avro-tools-1.7.4.jar tojsonbasicInfor.avro>basicInfor.json java -jar avro-tools-1.7.4.jar tojsonbasicInfor.snappy.avro>basicInfor.json
These commands will create the schema file from avro format to avsc file format for both compression data and text data:
java -jar avro-tools-1.7.4.jar getschemabasicInfor.avro>schemaFile.avsc java -jar avro-tools-1.7.4.jar getschemabasicInfor.snappy.avro>schemaFile2.avsc
See Also: Sqoop – Hive Job Failed When Executing With Query Option
This command creates mapping table for Avro data in HDFS. After run this command, we can do the query with Avro data:
CREATE EXTERNAL TABLE avroHive COMMENT "Mapping avro table in Hive" ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/data/mysample/avroTable' TBLPROPERTIES ( 'avro.schema.url'=/data/mysample/avroTableFormat' );
Verify the result
We can check in the local file to see the data with both avsc for schema, json file for the json data and avro file with avro format
cat filename
We also check the table in hive with command after access to hive client with command:
hive show tables;
Hope that this blog can help you guys understand how we can interact with Avro file in Hadoop and in big data application.
Professionals of big data hadoop architect have introduced Avro data format in this post. If you want to know more about the format or hadoop development, ask freely in comments.