In: Computer Science
Hadoop decided to abandon Java serialization, instead decided to implement their own serialization mechanism using Writable and WritableComparable interface. If you were the lead architect of Hadoop, would you have taken the same approach? Why? Why not?
No, Hadop doesn't use java serialization process.
Explanation :
Java comes with its own serialization mechanism, called Java
Object Serialization. i.e Java Serialization.
That is tightly intergreted with language.
In Big Data proccessing scenarios like Hadoop, we need to have
precise control over exactly how objects(It can by any data or
applications ) are written and read.
With serialization you can get some control but you may face lots
of problem ( will discuss problems below) central to hadoop. Even
same hurdles can comes
with RMI ( Remote Method Invocaton ) i.e Where RMI is an API which
allows an object to invoke a method on an object that exist in
another address place which could be on same machine
or remote machine.
RMI creates a public remote server object that enables client and
server side communications through simple method calls on the
server object.
Problems with Java Serialization :
1. Java Serialization doesn’t meet the criteria for a serialization format listed earlier: compact, fast, extensible, and interoperable.
2. Java Serialization is not compact, it writes the classname of
each object being written to the stream—this is true of classes
that
implement java.io.Serializable or java.io.Externalizable.
3. In other words, briefly we could say that in java serialization
process,
It writes meta data about the object which includes the class name,
field name and types and its super class.
ObjectOutputStream and ObjectInputStream optimize this.
but object sequences with handles can not be accessed randomly
since they reply on stream state and this further complicates
sorting like things.
Where as in Hadoop Serialization mechanism, while defining
‘Writable’ we(applications) know the expected class.
so writables do not store their type in the serialized
representation as while deserializing.
For example: if the input key is LongWritable instance , so an
empty LongWritable instance is asked to populate itself from the
input data stream.
As no Meta info need to be stored ( clasname, field, types,
super class), this results in more compact binary files, random
access and high performance.
So as hadoop technology core idea is to process large data sets
with high performance it follow its own serialization rather than
java serialization mechanism.
But in Hadoop, inter-process communication between nodes in the
system is implemented using remote procedure calls (RPCs).
The RPC protocol uses serialization to render the message into a
binary stream to be sent to the remote node,
which then deserializes the binary stream into the original
message