2005-08-19 15:46

XML Serialization
-----------------
Daniel S. Wilkerson

Elsa provides Xml serialization and de-serialization for the AST
classes and the Type system classes (which include Scope, Variable,
templates etc.)

**** The meaning of serialization

Elsa provides strong guarantees as to the meaning of serialization;
this is not a best-effort feature.  More specifically, the following
diagrams commute.  These diagrams are actually tested in checkxml.mk
which contains optional tests that are not run by default but will run
if you type "make checkxml".  In particular, diagram C is currently
tested on 714 test and thus is considered to be very reliable.  Note
that the later the serialization comes, the more information it must
preserve and thus the more difficult it is to make the diagram
commute.

In diagram B below, "print" means pretty-print.  Note that "parse"
really means "parse and then typecheck to disambiguate the AST and
then discard the derived types".  That is, we do not test serializing
an ambiguous "ASForest".

    diagram B:
    parse ------------------------> typecheck -> print
         \                      /
          -> toXml -> fromXml -

In diagram C below, "print" means debug-print, which Scott says
contains strictly more information than pretty-printing.

    diagram C:
    parse -> typecheck ------------------------> print
                      \                      /
                       -> toXml -> fromXml -

There are plans to make diagram D, below, commute, but for that to
happen we have to finish deciding exactly what it means 1) to
de-serialize a lowered AST with some of the template constructs
missing, and 2) to print a lowered AST (should it print the template
instantiations without the primary in such a way that it would compile
and produce a program with the same semantics?).

    diagram D:
    parse -> typecheck -> lower ------------------------> print
                               \                      /
                                -> toXml -> fromXml -

**** How serialization works

The Elsa AST classes are maintained in Scott's AST language in .ast
files and are generated by the astgen tool.  Generic traverse()
methods are generated on each AST class for use with the Visitor
design pattern; the Visitor class is also generated.  AST Xml
serialization is done as a Visitor using this pattern.  All of this
auto-generation is necessitated by the fact that there are over 140 AST
classes; it also ensures uniformity.

The Elsa Type-system classes, by which we also mean Scope, Variable,
and template related classes, are serialized with a special-purpose
serialization walk rather than a Visitor design pattern.  There is a
generic type-system traverse() but it is not used; a special-purpose
walk is more straightforward for this purpose.

The AST and Type graphs are serialized by separate modules.  However,
the AST and Type graphs also point into one another.  Thus the two
modules call into one another at those points, and the tags for the
two sets of objects interleave with one another in the output.

* Mapping objects to tags

All classes map to a tag of the same name that is used to serialize
the state of the class.  Each member of a class maps to an attribute
of its tag.

        The state of "simple" members, such as a bool, int, or string,
are fully captured in the attribute value.

        The state of "complex" members, such as an object, cannot fit
into an attribute string, so we employ a layer of indirection which
Scott calls the "Rat's Nest Scheme" as follows.  Given a parent
pointing to a child: 1) in the parent tag for the attribute pointing
to the child a unique id value for the child is written out; 2) the
child is serialized as a child-tag nested within the parent tag; 3)
the child tag gets a unique "_id" field so that upon de-serialization
it can be matched up with the parent field pointing at it.  Since in
an OO language potentially any object can be pointed to, in practice
all tags corresponding objects get an "_id".  See below for more on
the non-trivial task of generating unique ids.

        The state of "container" members, such as lists or maps, is
dealt with by simply promoting this particular "static instance" of a
container to its own tag "class": that is, the tag name is made up of
1) the parent class and 2) the member name of the container in that
parent.  See below for more on tag naming convention.

        Containers being classes gives rise to the need for "item"
members: anonymous members of the new container "classes".  That is,
if we just serialize each member of a container, we might hit an
object that has been serialized before and therefore will be skipped
(see below); there would then be no record at all that it is in this
container.  Therefore we wrap members of list and map classes with
"Item" tags; they just hold a pointer to the contained object in case
it doesn't actually get serialized there.  These "item" tags are the
only ones that do not get their own "_id" tag as during parsing their
single attribute is just stored directly into their container parent;
they don't have a separate existence in the OO world so no one ever
points to them.

Circularities in the object graph to not pose a problem: All
*pointers* to objects are *always* serialized as an attribute, however
an *object* is only serialized *once*.

Note that to simplify things, all tags are container tags whether they
point to any complex objects or not.

**** How de-serialization works

Separate modules are also used for de-serializing the AST and Type
graphs.  However recall that their tags are mixed in a fine-grained
way in the Xml.  This situation is handled by the use of a design
pattern I call: "non-deterministic virtual-dispatch".  We take the
de-serializers for the AST and for the Types and register them both
with the general-purpose class XmlReader.  During parsing when a tag
or attribute is encountered XmlReader asks both of the registered
readers to handle it in turn.  Each reader 1) either handles it or not
and then 2) returns a boolean saying whether it was able to: that is,
we "non-deterministically" do virtual dispatch to both registered
reader "subclasses" at once.  It is an error if neither can handle it,
and the dispatch is really done serially and short-circuits at the
first call that says it handled the tag.  The consequence is that
parsers for multiple Xml sub-languages can be written separately and
their tags interleaved.

The de-serializer is a stack machine.  When we de-serialize a tag, an
object is constructed to represent it and pushed onto the stack.
Every tag has a special "_id" attribute and when an object is
de-serialized it is filed in the id2obj hashtable under its id.  When
we encounter a close tag, we check that it matches the object on the
top of the stack and pop.

An attribute is always parsed in the context of the object on the top
of the stack.  If it is a simple type, we just write it into the
appropriate field of the top object.  If it is an object id, we create
an UnsatLink object to record 1) the id, 2) a pointer to the member of
the top object that should be updated to point at the referent object
having that id when it is found later during link satisfaction, see
below, 3) the type of the pointer so we can deal with up-casting
problems that arise with multiple inheritance (see below), and 4)
whether this parent/child relationship is a pointer or is one of
embedding.

When the tags have been de-serialized we satisfy all of the dangling
links as follows.  For each UnsatLink we 1) look up its id in the
id2obj hashtable.  If the parent/child relationship is a pointer (as
recorded earlier) we 2a1) do an up-cast if needed from the
de-serialized object type to the link type, 2a2) do a pointer
assignment satisfying the link.  If the parent/child relationship is
one of embedding we 2b) call operator=() to initialize the object from
the temporary one created on the stack (which is no longer needed).

Note that there is an alternative possible design for the temporary
representation of unsatisfied links during de-serialization.  If you
create an UnsatLink object for each unsatisfied link it is going to fragment
your heap even if you delete them, it is slow to make them all etc.
When the parent points to the child rather than embedding it, the
unsatisfied link can be simply written as an int into the still-unused
pointer to the child.  To distinguish it from a real pointer, recall
that real points have at least two low-order zero bits; just set them
to something else.  There is an additional restriction: there are only
3 other possibilities on a 32-bit machine, so only up to three
identity prefixes (see below) can be stored this way.  When an object
is de-serialized we can now just look at the top of the stack and
search through every pointer member looking for an id that matches
that of the object we just de-serialized and replace it with the real
pointer.  This scheme reduces the load on the list of UnsatLink-s as
state above, and also on the id2obj hashtable.  Given the memory
locality, it is probably more cache friendly as well.  It is currently
unimplemented.

**** Identity

An object needs a unique id both so 1) we can make sure we don't
serialize it twice by "checking it off" in some hashtable when we
serialize it, and 2) so we can serialize separately a) the reference
to an object and b) the object itself.

This second feature is needed so that we can deal with circularities
in the object graph, but it also makes any graph, even a tree, easier
to serialize because we can separate 1) serializing the header node
with its references to its children from 2) serializing the children.
The consequence is that we can now use a Visitor design pattern to
serialize rather than a walk.  This design is essential to using a
generic lowered traversal to serialize the lowered AST rather than a
special-purpose lowered-serializing walk.  Such a special-purpose
design would be error prone and not a good factoring since both
serialization and generating a lowered view of the AST/Types are
complex.

It turns out that a unique id is harder to come up with than you
might thing.  The heap address of an object is not a unique id!  1)
There are circumstances where more than one object can have the same
heap address.  2) There are circumstances where one object can have
more than one heap address.  As an exercise you might want to come up
with an example of each situation before reading further.

Two objects with the same heap address: The situation where more than
one object can share the same heap address is with inheritance and
embedding.  If a class A has a member B that 1) comes first in the
list of A's members, 2) is embedded, not pointed to, then A and B will
share the same heap address.  The solution to this is to break
symmetry with a system of id-prefixes.  The prefix of an object is
just a function of its class and we choose prefixes such that no two
classes with the same prefix ever embed within each other, even
transitively(!); now the ids are unique.  Prefixes are documented
below.

The situation of two objects sharing the same heap address also arises
with inheritance, as usually in C++ the superclass is implemented as
an anonymous embedded member of the subclass.  With single inheritance
though, there is an easy trick: never serialize the superclass, just
serialize its members as if they were the members of the subclass
(which is the illusion that the OO name lookup rules give you anyway);
do the reverse during de-serialization.  Now, when a pointer points to
the subclass as if it were the superclass (the whole point of
inheritance) the fact that they share the same heap address is a
feature, not a bug.

Two heap addresses for the same object: With multiple inheritance, an
object can have more than one heap address: if an object A has
superclasses B and C, then a B-pointer to A and a C-pointer to A do
*not* have the same value as raw ints.  The solution has two parts.

        1) During serialization, we are sure to always attempt to
downcast any B or C to an A and render it as an A if we can so that 1)
information is not lost and that 2) the ids in the pointer attribute
of the parent and the _id attribute of the child will match.

        2) During de-serialization, when we de-serialize the pointer
to a B or C, we record that it was a B or C pointer.  When we match up
the pointer with the de-serialized object later, we up-cast the C
object to an A or B before setting the pointer.  Recall that since
serialization is not type-safe and we are not using the real
type-system to manipulate these objects so we have to record what casts
are genuinely correct as a separate hand-rolled run-time-type-system as
we do this.

**** Container non-uniformity

Containers within Elsa come in several templatized classes which are
in turn instantiated with several container classes.  We don't want to
make this externally visible in the Xml schema as other analysis tools
may have different external representations.  We therefore only
maintain two kinds of containers externally: a "List", sequence of
objects, and a "NameMap", a map from strings to objects.

Note that the de-serialization handles this situation by using one
kind of list for de-serializing lists, an ASTList, and one kind of
name-map for de-serializing NameMaps, a StringRefMap.  This
one-class-fits-all approach during de-serialization is particularly
important for the problematic FakeLists which exist only in the
type-system and have no distinct existence separate from their
contents.  It is also helpful for lists that do not have a constant
time append and would have to be prepended to and then reversed.  All
of this could be handled without the use of ASTList intermediaries at
the expense of more complexity in the de-serialization code.

However, the situation is more complex than that: when a list tag is
initially de-serialized, we construct an ASTList template instantiation
appropriate to the kind of objects that will eventually be contained,
such as an ASTList<Type>.  We do not want to special-purpose the code
for appending to a list as new list items are encountered, though it
could be done.  We therefore exploit our internal knowledge of the
implementation of ASTList that its contents are pointed to and not
embedded.  Since all pointers are the same size (on my machine anyway)
this makes a diagram commute that should not according to the C++ type
system: we simply cast all ASTLists to ASTList<char> and treat them as
such for purposes of appending a new element.  Recall that there is a
final conversion of the list into the actual type that it will be
after serialization is complete, which may not even be an ASTList.
Something similar is done for name maps.

Note that there is a separate problem related to containers in that
the ephemeral "item" tags mentioned above which wrap their contents do
not map to an object that has permanent existence.  When an item close
tag is encountered, we simply use the id2obj hashtable right then to
look up the pointed to object and insert it into the container; the
item object on the stack can then be deleted.  (Note that at the
*close* item tag, the contained object must have been de-serialized
already, either immediately before or before that at some earlier
time.)  This process prevents item objects from participating in the
final linker satisfaction process, which, given their ephemeral
existence, would be unnatural.

However this scheme has one unfortunate consequence: this is the only
place where the id2obj hashtable has lookups called on it before
linker satisfaction time.  Since there are *no* insertions into the
hashtable once linker satisfaction starts if there were also *no*
lookups at all before linker satisfaction starts we could employ a
faster algorithm for hashtable construction: 1) while inserting into
the table, just maintain a list, not a proper hashtable, 2) just at
the point where the insertions stop and the lookups start, sort.

**** Conventions

Tag names for classes are just the class name.  For nested classes, we
use the fully qualified name and replace the '::' with an underscore.

Tag names for containers the following three things concatenated with
underscores 1) the kind (List, NameMap, Map), 2) the name of the class
that has the container as a member, and 3) the name of the member of
that class that is the container.  For example the tag of this member
    class CompoundType {
      const ObjList<BaseClass> bases;
    }
is
    <List_CompoundType_bases>  // Type system list 'class' convention

You might wonder why it is not something bottom up instead, which might
more naturally make for a re-usable tag, like this:
    <List_BaseClass>           // AST system list 'class' convention
We do not do this because two different classes can have a list of
BaseClasses that needs to be handled differently, so the
de-serialization depends on the containing context, which also
contains implicitly the information that it is a list of BaseClasses.

Each object has an _id attribute which is unique.  Note that the id is
basically the heap address prefixed by a letter code; the letter code
is to break symmetry for object with the same heap address, such as a
class with a first member that is embedded or a non-class such as a
FakeList node.  Neither the address nor the prefix has no semantics so
another program rendering out XML in this schema need not use the same
method, but any method for generating unique ids will work.  The
prefixes we use are "AST" for AST nodes and the rest are given at the
top of cc_type_xml.cc at the block of calls to the "identity" and
"identityTempl" macros.

Each container has a tag that wraps contained items in case the item
has been previously serialized and therefore may not appear.  They are
as follows.  Note that they do not correspond to objects constructed
internally and therefore do not have an '_id' field.  Their unique
metadata status is indicated by the leading underscore, similarly to
the '_id' field itself for objects.
    <_List_Item item=id>
    <_NameMap_Item name=<name> item=id>
