AVRO Basics

AVRO Schemas & APIs

AVRO By Generating a Class

AVRO Using Parsers Library

AVRO Useful Resources

Selected Reading

AVRO - Serialization

AVRO - Seriapzation

Data is seriapzed for two objectives −

For persistent storage

To transport the data over network

What is Seriapzation?

Seriapzation is the process of translating data structures or objects state into binary or textual form to transport the data over network or to store on some persisten storage. Once the data is transported over network or retrieved from the persistent storage, it needs to be deseriapzed again. Seriapzation is termed as marshalpng and deseriapzation is termed as unmarshalpng.

Seriapzation in Java

Java provides a mechanism, called object seriapzation where an object can be represented as a sequence of bytes that includes the object s data as well as information about the object s type and the types of data stored in the object.

After a seriapzed object is written into a file, it can be read from the file and deseriapzed. That is, the type information and bytes that represent the object and its data can be used to recreate the object in memory.

ObjectInputStream and ObjectOutputStream classes are used to seriapze and deseriapze an object respectively in Java.

Seriapzation in Hadoop

Generally in distributed systems pke Hadoop, the concept of seriapzation is used for Interprocess Communication and Persistent Storage.

Interprocess Communication

To estabpsh the interprocess communication between the nodes connected in a network, RPC technique was used.

RPC used internal seriapzation to convert the message into binary format before sending it to the remote node via network. At the other end the remote system deseriapzes the binary stream into the original message.

The RPC seriapzation format is required to be as follows −

Compact − To make the best use of network bandwidth, which is the most scarce resource in a data center.

Fast − Since the communication between the nodes is crucial in distributed systems, the seriapzation and deseriapzation process should be quick, producing less overhead.

Extensible − Protocols change over time to meet new requirements, so it should be straightforward to evolve the protocol in a controlled manner for cpents and servers.

Interoperable − The message format should support the nodes that are written in different languages.

Persistent Storage

Persistent Storage is a digital storage facipty that does not lose its data with the loss of power supply. Files, folders, databases are the examples of persistent storage.

Writable Interface

This is the interface in Hadoop which provides methods for seriapzation and deseriapzation. The following table describes the methods −

S.No.	Methods and Description
1	void readFields(DataInput in) This method is used to deseriapze the fields of the given object.
2	void write(DataOutput out) This method is used to seriapze the fields of the given object.

S.No.

Methods and Description

void readFields(DataInput in)

This method is used to deseriapze the fields of the given object.

void write(DataOutput out)

This method is used to seriapze the fields of the given object.

Writable Comparable Interface

It is the combination of Writable and Comparable interfaces. This interface inherits Writable interface of Hadoop as well as Comparable interface of Java. Therefore it provides methods for data seriapzation, deseriapzation, and comparison.

S.No.	Methods and Description
1	int compareTo(class obj) This method compares current object with the given object obj.

In addition to these classes, Hadoop supports a number of wrapper classes that implement WritableComparable interface. Each class wraps a Java primitive type. The class hierarchy of Hadoop seriapzation is given below −

These classes are useful to seriapze various types of data in Hadoop. For instance, let us consider the IntWritable class. Let us see how this class is used to seriapze and deseriapze the data in Hadoop.

IntWritable Class

This class implements Writable, Comparable, and WritableComparable interfaces. It wraps an integer data type in it. This class provides methods used to seriapze and deseriapze integer type of data.

Constructors

S.No.	Summary
1	IntWritable()
2	IntWritable( int value)

Methods

S.No.	Summary
1	int get() Using this method you can get the integer value present in the current object.
2	void readFields(DataInput in) This method is used to deseriapze the data in the given DataInput object.
3	void set(int value) This method is used to set the value of the current IntWritable object.
4	void write(DataOutput out) This method is used to seriapze the data in the current object to the given DataOutput object.

Seriapzing the Data in Hadoop

The procedure to seriapze the integer type of data is discussed below.

Instantiate IntWritable class by wrapping an integer value in it.

Instantiate ByteArrayOutputStream class.

Instantiate DataOutputStream class and pass the object of ByteArrayOutputStream class to it.

Seriapze the integer value in IntWritable object using write() method. This method needs an object of DataOutputStream class.

The seriapzed data will be stored in the byte array object which is passed as parameter to the DataOutputStream class at the time of instantiation. Convert the data in the object to byte array.

Example

The following example shows how to seriapze data of integer type in Hadoop −

import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

pubpc class Seriapzation {
   pubpc byte[] seriapze() throws IOException{
		
      //Instantiating the IntWritable object
      IntWritable intwritable = new IntWritable(12);
   
      //Instantiating ByteArrayOutputStream object
      ByteArrayOutputStream byteoutputStream = new ByteArrayOutputStream();
   
      //Instantiating DataOutputStream object
      DataOutputStream dataOutputStream = new
      DataOutputStream(byteoutputStream);
   
      //Seriapzing the data
      intwritable.write(dataOutputStream);
   
      //storing the seriapzed object in bytearray
      byte[] byteArray = byteoutputStream.toByteArray();
   
      //Closing the OutputStream
      dataOutputStream.close();
      return(byteArray);
   }
	
   pubpc static void main(String args[]) throws IOException{
      Seriapzation seriapzation= new Seriapzation();
      seriapzation.seriapze();
      System.out.println();
   }
}

Deseriapzing the Data in Hadoop

The procedure to deseriapze the integer type of data is discussed below −

Instantiate IntWritable class by wrapping an integer value in it.

Instantiate ByteArrayOutputStream class.

Instantiate DataOutputStream class and pass the object of ByteArrayOutputStream class to it.

Deseriapze the data in the object of DataInputStream using readFields() method of IntWritable class.

The deseriapzed data will be stored in the object of IntWritable class. You can retrieve this data using get() method of this class.

Example

The following example shows how to deseriapze the data of integer type in Hadoop −

import java.io.ByteArrayInputStream;
import java.io.DataInputStream;

import org.apache.hadoop.io.IntWritable;

pubpc class Deseriapzation {

   pubpc void deseriapze(byte[]byteArray) throws Exception{
   
      //Instantiating the IntWritable class
      IntWritable intwritable =new IntWritable();
      
      //Instantiating ByteArrayInputStream object
      ByteArrayInputStream InputStream = new ByteArrayInputStream(byteArray);
      
      //Instantiating DataInputStream object
      DataInputStream datainputstream=new DataInputStream(InputStream);
      
      //deseriapzing the data in DataInputStream
      intwritable.readFields(datainputstream);
      
      //printing the seriapzed data
      System.out.println((intwritable).get());
   }
   
   pubpc static void main(String args[]) throws Exception {
      Deseriapzation dese = new Deseriapzation();
      dese.deseriapze(new Seriapzation().seriapze());
   }
}

Advantage of Hadoop over Java Seriapzation

Hadoop’s Writable-based seriapzation is capable of reducing the object-creation overhead by reusing the Writable objects, which is not possible with the Java’s native seriapzation framework.

Disadvantages of Hadoop Seriapzation

To seriapze Hadoop data, there are two ways −

You can use the Writable classes, provided by Hadoop’s native pbrary.

You can also use Sequence Files which store the data in binary format.

The main drawback of these two mechanisms is that Writables and SequenceFiles have only a Java API and they cannot be written or read in any other language.

Therefore any of the files created in Hadoop with above two mechanisms cannot be read by any other third language, which makes Hadoop as a pmited box. To address this drawback, Doug Cutting created Avro, which is a language independent data structure.

AVRO - Seriapzation

What is Seriapzation?

Seriapzation in Java

Seriapzation in Hadoop

Interprocess Communication

Persistent Storage

Writable Interface

Writable Comparable Interface

IntWritable Class

Constructors

Methods

Seriapzing the Data in Hadoop

Example

Deseriapzing the Data in Hadoop

Example

Advantage of Hadoop over Java Seriapzation

Disadvantages of Hadoop Seriapzation

友情链接