Sunday, August 1, 2021

Understand Java Serialization from Serialization Stream

Java serialization is a most misused feature of the language. If you have a choice of not using it, do not use it. Otherwise, soldier on.

This article examines Java serialization from the perspective of the serialized stream, and particularly how various serialization approaches affect the content of the serialization stream. I find it helpful for my serialization design.

Java Serialization Stream Format

Java serialization stream is structured as follows:

Stream Header

A stream starts with the header of two short integers:

Structure

Section 

Type

Sample Value

Explanation

Stream Header

STREAM_MAGIC

short

-21267

0xaced


STREAM_VERSION

short

5

V2 introduced in JDK 1.2: primitive data is written in block data mode and is terminated with TC_ENDBLOCKDATA.


See: ObjectStream.writeStreamHeader


Primitives

Primitives are stored in its byte[] representation without any meta-data. As a result, reading and writing primitives must match exactly.


See: ObjectOutputStream#writeLong, #writeShort, #writeByte

Strings

Strings are specially structured, between primitives and full-blown objects. It starts with a marker constant, but does not need type descriptors as objects.

Structure

Section 

Type

Sample Value

Explanation

String

TC_STRING

byte


TC_LONGSTRING if the string length > 0xFFFF


String length

byte


long for TC_LONGSTRING


String value

UTF




See: ObjectOutputStream#writeString

Objects

There are several variations of the serialization format of objects. In the most common and simple case, a serializable object extends the Serializable marker interface. For example,


           public class Foo implements Serializable {

   private static final long serialVersionUID = 8999613412220288700L;


   private long size ;

   private  String name;


   public Foo(long sizer, String name) {

       this.size = size;

       this.name = name;

   }


   // keeps deserialization happy

   public Foo() { }

}

 


The serialization stream of the above object consists of two main sections:

  • Class descriptor, including class full name, serialization UID, serializable fields

  • Serial data


Structure

Section 

Type

Sample Value

Explanation

Object

TC_OBJECT

byte

115


  Class Descriptor

TC_CLASSDESC

byte

114



Class Full Name

UTF

“example.serialization.Foo”



Serialization UID

long




Flags

byte

2

(SC_SERIALIZABLE)

whether the class defines a writeObject method, and whether the class is serializable, externalizable, or an enum type, etc

    Serial Field Descriptor

Serial  Field Count

int 

2

Two serial fields

        looping

Field Type Code

int

74

‘I’: int

See: 

ObjectStreamField.getTyoeCide()


Field Name

UTF

“size”


         

Field Type Code

int

76  

‘L’: class or interface


Field Name

UTF

“name”


        end looping

Field type String

String

“Ljava/lang/String”

For non-primitive fields only


TC_ENDBLOCKDATA


byte

120


    Super Class Desc



112 (TC_NULL) 

No super class in this example

  Serial Data




out.defaultWriteObject() 

    Primitive Field values


byte[]

10L represented as byte[8]

Write primitive values compactly as byte[]. 

    Object Field Values





        Looping obj fields

Serial Value


“abc” (String)


        

TC_ENDBLOCKDATA

byte

120


* Serialization constants defined in ObjectStreamConstants


See: ObjectStream#writeOrdinaryObject, #writeNonProxy, #writeSerialData


We can see that Java serialization mixes the meta-data and data together. The original motivation of Java serialization is ambitious: to serialize and transfer any Java object across the wire, and the recipient honors any object being sent over. Moveover, by preserving the meta-data as part of the payload, it tries to offer acoss-version compatibility: you can serialize an old version of Java object, and deserialize it in a new version, and vice versa. Unfortunately, this leads to many thorny and fatal problems:

  • The stream is not compact, as it includes the schema part of the payload. If performance is important to you, Java serialization should be handcrafted, or even better, avoided.

  • The scheme is a security minefield if the stream cannot be trusted. 

Custom writeObject and readObject

The serialization specification supports custom read and write. We can change the class Foo as follows:


transient private long size;

transient private String name;


private void writeObject(ObjectOutputStream out) throws IOException {

   out.defaultWriteObject();

   out.writeLong(size);

   out.writeUTF(name);

}


private void readObject(java.io.ObjectInputStream in)

       throws IOException, ClassNotFoundException {

   in.defaultReadObject();

   size = in.readLong();

   name = in.readUTF();

}



The methods defaultReadObject() and defaultWriteObject() are responsible for writing and reading serializable fields.  In this example, as we make all existing serializable candidate fields transient, the default writeObject is a no-op.


The serialization stream of the object is as follows:


Structure

Section 

Type

Sample Value

Explanation

Object

TC_OBJECT

byte

115


  Class Descriptor

TC_CLASSDESC

byte

114



Class Full Name

UTF

“example.serialization.Foo”



Serialization UID

long




Flags

byte

3

(SC_SERIALIABLE | SC_WRITE_METHOD)

whether the class defines a writeObject method, and whether the class is serializable, externalizable, or an enum type, etc

    Serial Field Descriptor

Serial  Field Count

int 

0

No serializable fields in this example


TC_ENDBLOCKDATA


byte

120


    Super Class Desc



112 (TC_NULL) 

No super class in this example

  Serial Data




writeObject()



long

10

size field



UTF

“abc”

name field

        

TC_ENDBLOCKDATA

byte

120




Our custom writeObject writes the serial data more efficiently for two reasons:

  • making all serializable fields transient reduces the size of the class descriptor (though such a use case is better served by Externalizable)

  • writeUTF is slightly more compact than write a String object.

 In the end, it reduces the stream size from 93 to 62 bytes (with size = 10, name = “abc”).


Custom writeObject and readObject are typically used as an add-on to the Java default serialization.  The default write and read take care of serializable fields, and the custom implementation is used to support non-serializable fields. 


Of course, you can make all fields transient as in this example. Serialization is then the sole responsibility of your custom write and read. However, such a use case is better served by Externalizable.


Externalizable

Interface Externalizable explicitly shifts the serialization responsibility from the Java platform to the class author via the methods writeExternal and readExternal.


For example

public class Bar implements Externalizable {

   private long size;

   private String name;

   …


   @Override

   public void writeExternal (ObjectOutputStream out) throws IOException {

       out.writeLong(size);

       out.writeUTF(name);

   }


   @Override

   public void readExternal(java.io.ObjectInputStream in)

           throws IOException, ClassNotFoundException {

       in.defaultReadObject();

       size = in.readLong();

       name = in.readUTF();

   }

}



A sample serialization stream is as follows:


Structure

Section 

Type

Sample Value

Explanation

Object

TC_OBJECT

byte

115


  Class Descriptor

TC_CLASSDESC

byte

114



Class Full Name

UTF

“example.serialization.Bar”



Serialization UID

long




Flags

byte

12

(SC_EXTERNALIZBLE | SC_BLOCK_DATA)

whether the class defines a writeObject method, and whether the class is serializable, externalizable, or an enum type, etc

    Serial Field Descriptor

Serial  Field Count

int 

0

Always 0 for Externalizable


TC_ENDBLOCKDATA


byte

120


    Super Class Desc



112 (TC_NULL) 

No super class in this example

  Serial Data




writeObject()



long

10

size field



UTF

“abc”

name field

        

TC_ENDBLOCKDATA

byte

120



The stream is almost the same as the previous custom Serializable one where all fields are marked as transient. If your class is designed to entirely take care of its serialization, you should use Externalizable.

Serialization Proxy

Serialization proxy is a pattern to provide better control of the serialization via a proxy object as defined by the method writeReplace. For example,


public class ProxFoo implements Serializable {

   // No needed as the serialization is done by the proxy

   // private static final long serialVersionUID = 7484655704105000312L;


   private final long size;

   private final String name;


   public ProxFoo(long size, String name) {

       this.size = size;

       this.name = name;

   }


   // Default constructor no longer needed with proxy

   // public ProxBar() { }


   private void readObject(ObjectInputStream stream) throws InvalidObjectException {

       throw new InvalidObjectException("Proxy required");

   }


   // provides the proxy object to be serialized

   private Object writeReplace() {

       return new SerializationProxy(this);

   }


   private static class SerializationProxy implements Externalizable {

       private static final long serialVersionUID = 5726340402515774393L;

       private long size;

       private String name;


       public SerializationProxy() { }


       SerializationProxy(ProxFoo p) {

           size = p.size;

           name = p.name;

       }


       @Override

       public void writeExternal(ObjectOutput out) throws IOException {

           out.writeLong(size);

           out.writeUTF(name);

       }


       @Override

       public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {

           size = in.readLong();

           name = in.readUTF();

       }


      // Constructs the expected type during deserialization

       private Object readResolve() {

           return new ProxFoo(size, name);

       }

   }

}



A sample stream of the class is as follows:

Structure

Section 

Type

Sample Value

Explanation

Object

TC_OBJECT

byte

115


  Class Descriptor

TC_CLASSDESC

byte

114



Class Full Name

UTF

“examples.serialization.ProxBar$SerializationProxy”



Serialization UID

long

5726340402515774393

Proxy’s serial UID


Flags

byte

12

(SC_EXTERNALIZBLE | SC_BLOCK_DATA)

whether the class defines a writeObject method, and whether the class is serializable, externalizable, or an enum type, etc

    Serial Field Descriptor

Serial  Field Count

int 

0

Always 0 for Externalizable


TC_ENDBLOCKDATA


byte

120


    Super Class Desc



112 (TC_NULL) 

No super class in this example

  Serial Data




writeExternal()



long

10L represented as byte[8]

size field

 


UTF

“abc”

name field

        

TC_ENDBLOCKDATA

byte

120



The benefits of the serialization proxy pattern include:

  • to avoid extralinguistic construction of an object, as readResolve can construct the deserialized type by a normal constructor

  • to better enforce invariants

  • to make it possible to deserialize final fields.


Object and Type References

While Java serialization is not a compact format, it does try to minimize the serialization overhead:

  • If an object is already serialized, it is then referred to by its handle

  • If a class is already encountered, the class descriptor is referred to by its handle.


A sample serialization structure of an object with a referenced class descriptor is as follows:


Structure

Section 

Type

Sample Value

Explanation

Object

TC_OBJECT

byte

115


  Class Descriptor

TC_REFERNCE

byte




Class Descriptor Handle

int


Reference to an existing class descriptor

  Serial Data

Serial data



Depending on the serialization implementation of the class 

        

    TC_ENDBLOCKDATA

byte





In other words, each object with a referenced descriptor spends extra 1+1+4+1=7 bytes for markers and the meta-data reference.  Use List<Long> as an example. Serializing each long value takes 8 bytes, but we also need to serialize additional 7 bytes for the markers and meta-data. The scheme is very inefficient.


Serialization by Primitives

Based on what we have learned, when you handcraft serialization, you should aim to directly serialize the primitives of containing fields, instead of relying on the default object serialization. This has many advantages:

  • It helps you to serialize the logic data model instead of your implementation details, making it possible for you to change the class implementation later.

  • It is much more compact.

  • It is more secure.

  • It minimizes the scope of serialization, as it does not force the containing objects to implement Serializable.


Here is an example,


public class Container implements Externalizable {

   private static final long serialVersionUID = 5174256763653270387L;


   private List<Bar> bars;


   public Container(List<Bar> bars) {

       this.bars = (bars == null) ? Collections.emptyList() : bars;

   }


   //To make serialization happy

   public Container() { }


   public List<Bar> getBars() {

       return bars;

   }


   @Override

   public void writeExternal(ObjectOutput out) throws IOException {

       int count = bars == null? 0 : bars.size();

       out.writeInt(count);

       if (count > 0) {

           for (Bar bar : bars) {

               // serialize Bar using primitive data

               out.writeLong(bar.getSize());

               out.writeUTF(bar.getName());

           }

       }

   }


   @Override

   public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {

       int count = in.read();

       bars = (count == 0) ? Collections.emptyList() : new ArrayList<>(count);

       for (int index = 0; index < count; index++) {

           long size = in.readLong();

           String name = in.readUTF();

           bars.add(new Bar(size, name));

       }

  }

}


Compared with the default scheme to serialize Bar objects directly, this approach reduces the stream size by 50%. 

Summary

This article examines how serialization approaches impact the serialization stream, including

  • Default Serializable

  • Custom writeObject and readObject

  • Externalizable

  • Serialization proxy

  • Referenced objects and types


When you handcraft serialization, you should serialize via primitives. 



Reference