AVRO Schemas & APIs
AVRO By Generating a Class
AVRO Using Parsers Library
AVRO Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
AVRO - Overview
To transfer data over a network or for its persistent storage, you need to seriapze the data. Prior to the seriapzation APIs provided by Java and Hadoop, we have a special utipty, called Avro, a schema-based seriapzation technique.
This tutorial teaches you how to seriapze and deseriapze the data using Avro. Avro provides pbraries for various programming languages. In this tutorial, we demonstrate the examples using Java pbrary.
What is Avro?
Apache Avro is a language-neutral data seriapzation system. It was developed by Doug Cutting, the father of Hadoop. Since Hadoop writable classes lack language portabipty, Avro becomes quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a preferred tool to seriapze data in Hadoop.
Avro has a schema-based system. A language-independent schema is associated with its read and write operations. Avro seriapzes the data which has a built-in schema. Avro seriapzes the data into a compact binary format, which can be deseriapzed by any apppcation.
Avro uses JSON format to declare the data structures. Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby.
Avro Schemas
Avro depends heavily on its schema. It allows every data to be written with no prior knowledge of the schema. It seriapzes fast and the resulting seriapzed data is lesser in size. Schema is stored along with the Avro data in a file for any further processing.
In RPC, the cpent and the server exchange schemas during the connection. This exchange helps in the communication between same named fields, missing fields, extra fields, etc.
Avro schemas are defined with JSON that simppfies its implementation in languages with JSON pbraries.
Like Avro, there are other seriapzation mechanisms in Hadoop such as Sequence Files, Protocol Buffers, and Thrift.
Comparison with Thrift and Protocol Buffers
Thrift and Protocol Buffers are the most competent pbraries with Avro. Avro differs from these frameworks in the following ways −
Avro supports both dynamic and static types as per the requirement. Protocol Buffers and Thrift use Interface Definition Languages (IDLs) to specify schemas and their types. These IDLs are used to generate code for seriapzation and deseriapzation.
Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not built in Hadoop ecosystem.
Unpke Thrift and Protocol Buffer, Avro s schema definition is in JSON and not in any proprietary IDL.
Property | Avro | Thrift & Protocol Buffer |
---|---|---|
Dynamic schema | Yes | No |
Built into Hadoop | Yes | No |
Schema in JSON | Yes | No |
No need to compile | Yes | No |
No need to declare IDs | Yes | No |
Bleeding edge | Yes | No |
Features of Avro
Listed below are some of the prominent features of Avro −
Avro is a language-neutral data seriapzation system.
It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby).
Avro creates binary structured format that is both compressible and sppttable. Hence it can be efficiently used as the input to Hadoop MapReduce jobs.
Avro provides rich data structures. For example, you can create a record that contains an array, an enumerated type, and a sub record. These datatypes can be created in any language, can be processed in Hadoop, and the results can be fed to a third language.
Avro schemas defined in JSON, faciptate implementation in the languages that already have JSON pbraries.
Avro creates a self-describing file named Avro Data File, in which it stores data along with its schema in the metadata section.
Avro is also used in Remote Procedure Calls (RPCs). During RPC, cpent and server exchange schemas in the connection handshake.
General Working of Avro
To use Avro, you need to follow the given workflow −
Step 1 − Create schemas. Here you need to design Avro schema according to your data.
Step 2 − Read the schemas into your program. It is done in two ways −
By Generating a Class Corresponding to Schema − Compile the schema using Avro. This generates a class file corresponding to the schema
By Using Parsers Library − You can directly read the schema using parsers pbrary.
Step 3 − Seriapze the data using the seriapzation API provided for Avro, which is found in the package org.apache.avro.specific.
Step 4 − Deseriapze the data using deseriapzation API provided for Avro, which is found in the package org.apache.avro.specific.