Java serialization is a most misused feature of the language. If you have a choice of not using it, do not use it. Otherwise, soldier on.

This article examines Java serialization from the perspective of the serialized stream, and particularly how various serialization approaches affect the content of the serialization stream. I find it helpful for my serialization design.

Java Serialization Stream Format

Java serialization stream is structured as follows:

Stream Header

A stream starts with the header of two short integers:

Structure	Section	Type	Sample Value	Explanation
Stream Header	STREAM_MAGIC	short	-21267	0xaced
	STREAM_VERSION	short	5	V2 introduced in JDK 1.2: primitive data is written in block data mode and is terminated with TC_ENDBLOCKDATA.

See: ObjectStream.writeStreamHeader

Primitives

Primitives are stored in its byte[] representation without any meta-data. As a result, reading and writing primitives must match exactly.

See: ObjectOutputStream#writeLong, #writeShort, #writeByte

Strings

Strings are specially structured, between primitives and full-blown objects. It starts with a marker constant, but does not need type descriptors as objects.

Structure	Section	Type	Sample Value	Explanation
String	TC_STRING	byte		TC_LONGSTRING if the string length > 0xFFFF
	String length	byte		long for TC_LONGSTRING
	String value	UTF

See: ObjectOutputStream#writeString

Objects

There are several variations of the serialization format of objects. In the most common and simple case, a serializable object extends the Serializable marker interface. For example,

public class Foo implements Serializable {

private static final long serialVersionUID = 8999613412220288700L;

private long size ;

private String name;

public Foo(long sizer, String name) {

this.size = size;

this.name = name;

}

// keeps deserialization happy

public Foo() { }

}

The serialization stream of the above object consists of two main sections:

Class descriptor, including class full name, serialization UID, serializable fields
Serial data

Structure	Section	Type	Sample Value	Explanation
Object	TC_OBJECT	byte	115
Class Descriptor	TC_CLASSDESC	byte	114
	Class Full Name	UTF	“example.serialization.Foo”
	Serialization UID	long
	Flags	byte	2 (SC_SERIALIZABLE)	whether the class defines a writeObject method, and whether the class is serializable, externalizable, or an enum type, etc
Serial Field Descriptor	Serial Field Count	int	2	Two serial fields
looping	Field Type Code	int	74	‘I’: int See: ObjectStreamField.getTyoeCide()
	Field Name	UTF	“size”
	Field Type Code	int	76	‘L’: class or interface
	Field Name	UTF	“name”
end looping	Field type String	String	“Ljava/lang/String”	For non-primitive fields only
	TC_ENDBLOCKDATA	byte	120
Super Class Desc			112 (TC_NULL)	No super class in this example
Serial Data				out.defaultWriteObject()
Primitive Field values		byte[]	10L represented as byte[8]	Write primitive values compactly as byte[].
Object Field Values
Looping obj fields	Serial Value		“abc” (String)
	TC_ENDBLOCKDATA	byte	120

* Serialization constants defined in ObjectStreamConstants

See: ObjectStream#writeOrdinaryObject, #writeNonProxy, #writeSerialData

We can see that Java serialization mixes the meta-data and data together. The original motivation of Java serialization is ambitious: to serialize and transfer any Java object across the wire, and the recipient honors any object being sent over. Moveover, by preserving the meta-data as part of the payload, it tries to offer acoss-version compatibility: you can serialize an old version of Java object, and deserialize it in a new version, and vice versa. Unfortunately, this leads to many thorny and fatal problems:

The stream is not compact, as it includes the schema part of the payload. If performance is important to you, Java serialization should be handcrafted, or even better, avoided.
The scheme is a security minefield if the stream cannot be trusted.

Custom writeObject and readObject

The serialization specification supports custom read and write. We can change the class Foo as follows:

transient private long size;

transient private String name;

private void writeObject(ObjectOutputStream out) throws IOException {

out.defaultWriteObject();

out.writeLong(size);

out.writeUTF(name);

}

private void readObject(java.io.ObjectInputStream in)

throws IOException, ClassNotFoundException {

in.defaultReadObject();

size = in.readLong();

name = in.readUTF();

}

The methods defaultReadObject() and defaultWriteObject() are responsible for writing and reading serializable fields. In this example, as we make all existing serializable candidate fields transient, the default writeObject is a no-op.

The serialization stream of the object is as follows:

Structure	Section	Type	Sample Value	Explanation
Object	TC_OBJECT	byte	115
Class Descriptor	TC_CLASSDESC	byte	114
	Class Full Name	UTF	“example.serialization.Foo”
	Serialization UID	long
	Flags	byte	3 (SC_SERIALIABLE \| SC_WRITE_METHOD)	whether the class defines a writeObject method, and whether the class is serializable, externalizable, or an enum type, etc
Serial Field Descriptor	Serial Field Count	int	0	No serializable fields in this example
	TC_ENDBLOCKDATA	byte	120
Super Class Desc			112 (TC_NULL)	No super class in this example
Serial Data				writeObject()
		long	10	size field
		UTF	“abc”	name field
	TC_ENDBLOCKDATA	byte	120

Our custom writeObject writes the serial data more efficiently for two reasons:

making all serializable fields transient reduces the size of the class descriptor (though such a use case is better served by Externalizable)
writeUTF is slightly more compact than write a String object.

In the end, it reduces the stream size from 93 to 62 bytes (with size = 10, name = “abc”).

Custom writeObject and readObject are typically used as an add-on to the Java default serialization. The default write and read take care of serializable fields, and the custom implementation is used to support non-serializable fields.

Of course, you can make all fields transient as in this example. Serialization is then the sole responsibility of your custom write and read. However, such a use case is better served by Externalizable.

Externalizable

Interface Externalizable explicitly shifts the serialization responsibility from the Java platform to the class author via the methods writeExternal and readExternal.

For example

public class Bar implements Externalizable {

private long size;

private String name;

…

@Override

public void writeExternal (ObjectOutputStream out) throws IOException {

out.writeLong(size);

out.writeUTF(name);

}

@Override

public void readExternal(java.io.ObjectInputStream in)

throws IOException, ClassNotFoundException {

in.defaultReadObject();

size = in.readLong();

name = in.readUTF();

}

A sample serialization stream is as follows:

Structure	Section	Type	Sample Value	Explanation
Object	TC_OBJECT	byte	115
Class Descriptor	TC_CLASSDESC	byte	114
	Class Full Name	UTF	“example.serialization.Bar”
	Serialization UID	long
	Flags	byte	12 (SC_EXTERNALIZBLE \| SC_BLOCK_DATA)	whether the class defines a writeObject method, and whether the class is serializable, externalizable, or an enum type, etc
Serial Field Descriptor	Serial Field Count	int	0	Always 0 for Externalizable
	TC_ENDBLOCKDATA	byte	120
Super Class Desc			112 (TC_NULL)	No super class in this example
Serial Data				writeObject()
		long	10	size field
		UTF	“abc”	name field
	TC_ENDBLOCKDATA	byte	120

The stream is almost the same as the previous custom Serializable one where all fields are marked as transient. If your class is designed to entirely take care of its serialization, you should use Externalizable.

Serialization Proxy

Serialization proxy is a pattern to provide better control of the serialization via a proxy object as defined by the method writeReplace. For example,

public class ProxFoo implements Serializable {

// No needed as the serialization is done by the proxy

// private static final long serialVersionUID = 7484655704105000312L;

private final long size;

private final String name;

public ProxFoo(long size, String name) {

this.size = size;

this.name = name;

}

// Default constructor no longer needed with proxy

// public ProxBar() { }

private void readObject(ObjectInputStream stream) throws InvalidObjectException {

throw new InvalidObjectException("Proxy required");

}

// provides the proxy object to be serialized

private Object writeReplace() {

return new SerializationProxy(this);

}

private static class SerializationProxy implements Externalizable {

private static final long serialVersionUID = 5726340402515774393L;

private long size;

private String name;

public SerializationProxy() { }

SerializationProxy(ProxFoo p) {

size = p.size;

name = p.name;

}

@Override

public void writeExternal(ObjectOutput out) throws IOException {

out.writeLong(size);

out.writeUTF(name);

}

@Override

public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {

size = in.readLong();

name = in.readUTF();

}

// Constructs the expected type during deserialization

private Object readResolve() {

return new ProxFoo(size, name);

}

A sample stream of the class is as follows:

Structure	Section	Type	Sample Value	Explanation
Object	TC_OBJECT	byte	115
Class Descriptor	TC_CLASSDESC	byte	114
	Class Full Name	UTF	“examples.serialization.ProxBar$SerializationProxy”
	Serialization UID	long	5726340402515774393	Proxy’s serial UID
	Flags	byte	12 (SC_EXTERNALIZBLE \| SC_BLOCK_DATA)	whether the class defines a writeObject method, and whether the class is serializable, externalizable, or an enum type, etc
Serial Field Descriptor	Serial Field Count	int	0	Always 0 for Externalizable
	TC_ENDBLOCKDATA	byte	120
Super Class Desc			112 (TC_NULL)	No super class in this example
Serial Data				writeExternal()
		long	10L represented as byte[8]	size field
		UTF	“abc”	name field
	TC_ENDBLOCKDATA	byte	120

The benefits of the serialization proxy pattern include:

to avoid extralinguistic construction of an object, as readResolve can construct the deserialized type by a normal constructor
to better enforce invariants
to make it possible to deserialize final fields.

Object and Type References

While Java serialization is not a compact format, it does try to minimize the serialization overhead:

If an object is already serialized, it is then referred to by its handle
If a class is already encountered, the class descriptor is referred to by its handle.

A sample serialization structure of an object with a referenced class descriptor is as follows:

Structure	Section	Type	Sample Value	Explanation
Object	TC_OBJECT	byte	115
Class Descriptor	TC_REFERNCE	byte
	Class Descriptor Handle	int		Reference to an existing class descriptor
Serial Data	Serial data			Depending on the serialization implementation of the class
	TC_ENDBLOCKDATA	byte

In other words, each object with a referenced descriptor spends extra 1+1+4+1=7 bytes for markers and the meta-data reference. Use List<Long> as an example. Serializing each long value takes 8 bytes, but we also need to serialize additional 7 bytes for the markers and meta-data. The scheme is very inefficient.

Serialization by Primitives

Based on what we have learned, when you handcraft serialization, you should aim to directly serialize the primitives of containing fields, instead of relying on the default object serialization. This has many advantages:

It helps you to serialize the logic data model instead of your implementation details, making it possible for you to change the class implementation later.
It is much more compact.
It is more secure.
It minimizes the scope of serialization, as it does not force the containing objects to implement Serializable.

Here is an example,

public class Container implements Externalizable {

private static final long serialVersionUID = 5174256763653270387L;

private List<Bar> bars;

public Container(List<Bar> bars) {

this.bars = (bars == null) ? Collections.emptyList() : bars;

}

//To make serialization happy

public Container() { }

public List<Bar> getBars() {

return bars;

}

@Override

public void writeExternal(ObjectOutput out) throws IOException {

int count = bars == null? 0 : bars.size();

out.writeInt(count);

if (count > 0) {

for (Bar bar : bars) {

// serialize Bar using primitive data

out.writeLong(bar.getSize());

out.writeUTF(bar.getName());

}

@Override

public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {

int count = in.read();

bars = (count == 0) ? Collections.emptyList() : new ArrayList<>(count);

for (int index = 0; index < count; index++) {

long size = in.readLong();

String name = in.readUTF();

bars.add(new Bar(size, name));

}

Compared with the default scheme to serialize Bar objects directly, this approach reduces the stream size by 50%.

Summary

This article examines how serialization approaches impact the serialization stream, including

Default Serializable
Custom writeObject and readObject
Externalizable
Serialization proxy
Referenced objects and types

When you handcraft serialization, you should serialize via primitives.

Reference

Effective Java, 3rd Ed, Joshua Bloch
Java Object Serialization Specs

Re's Blog

Sunday, August 1, 2021

Understand Java Serialization from Serialization Stream

Java Serialization Stream Format

Stream Header

Primitives

Strings

Objects

Custom writeObject and readObject

Externalizable

Serialization Proxy

Object and Type References

Serialization by Primitives

Summary

Reference

No comments: