3.1.3 Serialize data structure

Random access and binary access are useful for data that is private to a program. However, unless great care has been taken to design the file format, binary files (that permits random access) are difficult to export. For example, a PowerPC processor is big endian, while an Intel processor is little endian. Unless care has been taken to make sure numbers are represented using one standard, the same file can be interpreted differently.

Another problem with binary files is that the content is difficult to be read by a person.

This is why it is sometimes important to serialize a data structure so that it can be exported to a text file, or any stream. When a data structure is serialized into a text stream, it is no longer practical or even possible to use a fixed size.

Although the serialized format can be arbitrary, it is usually a good idea to use well accepted structured formats. In the past, a list (as in Lisp list) format is fairly common. The serialization code for our struct X may appear as follows:

void X_serialize(struct X *pX, FILE *pFile)
{
  fprintf(pFile, "(X (name \"%s\") (gender '%c'))", pX->name, pX->gender);
}

Because of the wide acceptance of XML (extensible markup language), we can also implement the serialization function as follows:

void X_serialize(struct X *pX, FILE *pFile)
{
  fprintf(pFile, "<X> <name> \"%s\"</name> <gender> '%c'</gender></X>", pX->name, pX->gender);
}

Note that in both cases, nextItem is not represented. This is because we can represent a list in a stream just be their sequential ordering in the stream. Any attempt to random read serialized structures in a file require either a companion index file or a pre-scan and an in-memory index. It is no longer practical to support random write access with a serialized stream.

The serialization of any tree-like (each node is only pointed to by one other node) structure can be serialized without explicit references. A list is simply a special tree with one long branch.

However, a data structure that contains multiple references to one node (such as a general digragph) requires special attention. The serialization needs to include a unique text ID for each node, and references utilize the text ID. It is also necessary to use special traversal algorithms through the data structure so that a node is not represented more than once in the stream.

When a program reads a serialized stream, it needs to reconstruct the data structure either in memory or in a binary random read/write file. This reconstruction requires the parsing of the Lisp-like or XML input stream. Fortunately, the syntax of Lisp and XML are very simple to parse

The parsing of a Lisp or XML file is beyond the scope of this module. However, that makes a good topic for the discussion of recursions.

Copyright © 2006-10-16 by Tak Auyeung