Friday 23 July 2021
  • :
  • :

Avro Data Format In Hadoop For Big Data Application

Experts of big data hadoop architect introduce the Avro data format and its use in hadoop development and big data apps. You can go through the article and learn the concept of Avro data format as explained by professionals.

Introduction To Avro data format

Avro is data serialization system. It provides fast, binary data format on schema. When we read the Avro format, schema always used to write the data. This makes the serialization very fast with small time.

The input data with Avro is defined format data type so it will help other developers ensure the input format. When we store the data with Avro, the schema file will be stored at the same time so that files can be process with other program from other developers.

Apache Avro


Java: JDK 1.7

Avro tool:

Cloudera version: CDH4.6.0

Initial steps

  1. We need to prepare Avro tool at the above download link.
  2. We need to prepare some input data in JSON format to convert to Avro binary data and convert from Avro binary data to JSON.schemaFile.avsc file is schema file for Avro, this is sample avsc file:
     "type" : "record",
     "name" : "twitter_schema",
     "namespace" : "",
     "fields" : [ {
     "name" : "id",
     "type" : "string",
     "doc" : "id of user"
     }, {
     "name" : "name",
     "type" : "string",
     "doc" : "name of user"
     }, {
     "name" : "timestamp",
     "type" : "long",
     "doc" : "Unix epoch time in seconds"
     } ],
     "doc:" : "Basic information of user"

    basicInfor.json file is the real data file which will be converted to Avro with Avro tool:

     {"id":"1","name":"Jean","timestamp": 1366123856 }
     {"id":"2","name":"Jang","timestamp": 1366489544 }

3. Put files to HDFS with the command

hadoop fs -mkdir -p /data/mysample/avroTable
 hadoop fs -put basicInfor.json/data/mysample/avroTable
 hadoop fs -put schemaFile.avsc/data/mysample/avroTableFormat

Code walk through

This command will convert the data from json data to Avro format.

java -jar avro-tools-1.7.4.jar fromjson --schema-file schemaFile.avscbasicInfor.json>basicInfor.avro

These commands will convert the data from avro format to JSON file format for both compression data and text data:

 java -jar avro-tools-1.7.4.jar tojsonbasicInfor.avro>basicInfor.json
 java -jar avro-tools-1.7.4.jar tojsonbasicInfor.snappy.avro>basicInfor.json

These commands will create the schema file from avro format to avsc file format for both compression data and text data:

 java -jar avro-tools-1.7.4.jar getschemabasicInfor.avro>schemaFile.avsc
 java -jar avro-tools-1.7.4.jar getschemabasicInfor.snappy.avro>schemaFile2.avsc
See Also: Sqoop – Hive Job Failed When Executing With Query Option 

This command creates mapping table for Avro data in HDFS. After run this command, we can do the query with Avro data:

 COMMENT "Mapping avro table in Hive"
 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
 LOCATION '/data/mysample/avroTable'

Verify the result

We can check in the local file to see the data with both avsc for schema, json file for the json data and avro file with avro format

cat filename

We also check the table in hive with command after access to hive client with command:

show tables;

Hope that this blog can help you guys understand how we can interact with Avro file in Hadoop and in big data application.

Professionals of big data hadoop architect have introduced Avro data format in this post. If you want to know more about the format or hadoop development, ask freely in comments.

Vijay is a compulsive blogger who likes to educate like-minded people on various new technologies and trends. He works with Aegis SoftTech as a software developer and has been developing software for years. Stay Connected to him on Facebook and Google+.