Efficient Navigation on large XML Structures

XML is great for platform-independent structured data exchange. JAXB is an incredibly handy piece of Java technology, which releases the developer from the cumbersome task of manually writing (de)serialization code and providing him regular Java POJOs as a programming interface. However, when it comes to navigation in large, highly cross-linked object graphs obtained from pure JAXB, you may face serious performance challenges.

This article will give some insights on how to solve these challenges in an elegant and yet efficient way. The Open Source project can be found on GitHub.

The Challenge

The model I use as a reference example in this article is the Vehicle Electric Container, which can be used to describe the complete electrical network of a modern car. To give you a feeling of what I am talking about when saying “large highly cross-linked object graphs”, here are some key numbers:

Structure:
- 300 different classes
- 200 individual cross references (containments not included)
Instances:
- > 300Mb of XML
- > 1.5 million objects
- > 2.3 million cross-links

If you want to navigate freely on the structure, then using JAXB out of the box is not enough. I will explain how this can be achieved without losing the advantages of JAXB, without writing clumsy code and without violating SOLID principles. But, first things first.

The Model

Example Section of the Model

The above diagram shows a small portion of the model which I will use to explain the challenges and, later, the solution. It represents the schematic of an electrical system - the left side (highlighted in blue) represents the electrical components, the right side (highlighted in green) the connectivity among them.

A required function on this model could be: Give me all ComponentNodes that are connected to ComponentPort X.

There are many possible ways to calculate the result, but the most straight-forward and intuitive way is:

Navigate to the ConnectionEnds referencing the ComponentPort X.
Navigate to the Connection of the found ConnectionEnds and find the other sides (ConnectionEnds).
Navigate from the other ConnectionEnds to the referenced ComponentPorts and from there up the hierarchy, over the ComponentConnector to the ComponentNode.
You have reached your result!

So far so good. The problem with this is, that in pure XML / JAXB most of the above navigations are simply not available. Why is this? Let’s have a look at the XML and Java JAXB representation.

The Schema

XML is not designed especially for object-oriented serialization / deserialization, it is a far more general concept. To use it for OO serialization, the OO concepts must be mapped on defined XML concepts. In our case, the transformation pattern from the UML Model to XML Schema is the following:

A class will be translated as xs:complexType.
An inheritance will be translated as xs:extension
All xs:complexType will have an xs:attribute with the name id and the type xs:ID as a technical primary key within the XML file. This is not included in the listing, as this attribute is already defined in the xs:extension base
An attribute will be translated as xs:element of the corresponding xs:complexType.
A composition will be translated as xs:element (nested classes / elements) with the aggregated class as type.
An association will be translated as xs:element with xs:IDREF(S) as type.

The resulting XML Schema definition for some classes in the above diagram can be found in the listing below (I omitted all details that are not relevant for the example):

Partial listing of the XML Schema

   <xs:complexType name="ComponentPort">
      <xs:complexContent>
         <xs:extension base="vec:ConfigurableElement">
            <xs:sequence>
               <xs:element name="Identification" type="xs:string"/>
               ...
            </xs:sequence>
         </xs:extension>
      </xs:complexContent>
   </xs:complexType>

   <xs:complexType name="ComponentConnector">
      <xs:complexContent>
         <xs:extension base="vec:ConfigurableElement">
            <xs:sequence>
               <xs:element name="Identification" type="xs:string"/>
               ...
               <xs:element name="ComponentPort"
                           type="vec:ComponentPort"
                           minOccurs="0"
                           maxOccurs="unbounded"/>
            </xs:sequence>
         </xs:extension>
      </xs:complexContent>
   </xs:complexType>

   <xs:complexType name="ConnectionEnd">
      <xs:complexContent>
         <xs:extension base="vec:ConfigurableElement">
            <xs:sequence>
               <xs:element name="Identification" type="xs:string"/>
               ...
               <xs:element name="ConnectedComponentPort" type="xs:IDREF"/>
            </xs:sequence>
         </xs:extension>
      </xs:complexContent>
   </xs:complexType>

The Java Classes

If you feed this schema into the “JAXB XML Schema to Java Compiler” (XJC), you will receive Java classes similar to the ones in the listing below.

There are some notable details in the generated classes you need to be aware of:

All classes (e.g. VecComponentPort) know nothing about incoming references. There is also no reference to the containing class (e.g. VecComponentConnector) nor is there a possibility for reverse navigation to referencing classes (e.g. VecConnectionEnd). Therefore, the desired navigation described above is blocked. There are two reasons for this:
1. JAXB classes can also be used for serialization. Bidirectional associations would redundantly store information, allow inconsistent data structures and would add more complexity to the serializer.
2. Due to the general purpose of XML, it is impossible for the XJC to determine incoming references for all XML use cases.
The Java representation of IDREF associations (e.g. VecConnectionEnd.connectedComponentPort) are defined with the base type Object, although the UML model defines it as VecComponentPort. The reason for this is, that the syntax defined by the XML Schema is less strict than the definition of the UML model. In XML Schema IDREF(S) references are not typed and can point to any element with a valid xs:ID. Therefore, it is impossible for XJC determine the correct type.

Listing of JAXB generated classes

First java class: VecComponentPort

   @XmlAccessorType(XmlAccessType.FIELD)
   @XmlType(name = "ComponentPort", propOrder = {
       "identification"
   })
   public class VecComponentPort
       extends VecConfigurableElement
       implements Serializable
   {
       @XmlElement(name = "Identification", required = true)
       protected String identification;

    public String getIdentification() {
        return identification;
    }

       public void setIdentification(String value) {
        this.identification = value;
       }
   }

Second java class: VecComponentConnector

    @XmlAccessorType(XmlAccessType.FIELD)
    @XmlType(name = "ComponentConnector", propOrder = {
        "identification",
        "componentPorts"
    })
    public class VecComponentConnector
        extends VecConfigurableElement
        implements Serializable
    {

        @XmlElement(name = "Identification", required = true)
        protected String identification;

        @XmlElement(name = "ComponentPort")
        protected List<VecComponentPort> componentPorts;

        public String getIdentification() {
            return identification;
        }

        public void setIdentification(String value) {
            this.identification = value;
        }

        public List<VecComponentPort> getComponentPorts() {
            if (componentPorts == null) {
            componentPorts = new ArrayList<VecComponentPort>();
            }
            return this.componentPorts;
        }
    }

Third java class: VecConnectionEnd

    @XmlAccessorType(XmlAccessType.FIELD)
    @XmlType(name = "ConnectionEnd", propOrder = {
        "identification",
        "connectedComponentPort"
    })
    public class VecConnectionEnd
        extends VecConfigurableElement
        implements Serializable
    {

        @XmlElement(name = "Identification", required = true)
        protected String identification;

        @XmlElement(name = "ConnectedComponentPort", required = true, type = Object.class)
        @XmlIDREF
        @XmlSchemaType(name = "IDREF")
        protected Object connectedComponentPort;

        public String getIdentification() {
            return identification;
        }

        public void setIdentification(String value) {
            this.identification = value;
        }

        public Object getConnectedComponentPort() {
            return connectedComponentPort;
        }

        public void setConnectedComponentPort(Object value) {
            this.connectedComponentPort = value;
        }
    }

Dead End Approaches

For an implementation with pure JAXB classes, I have seen two approaches which both have their own drawbacks. In order to create awareness for the problems arising from these, I will discuss the two approaches briefly.

The Pragmatic Approach

The pragmatic approach is: “If the direct way is blocked, I will go around!”. However, as in the real world, detours are normally longer.

In this approach, just the navigation from ComponentPort to the connected ComponentPorts would require:

Start at the root (ConnectionSpecification), walk through the hierarchy and iterate over all ConnectionEnds.
Check if the referenced ComponentPort is the same as the one where we started.
If that is the case, select the other ConnectionEnds. For this operation you will have to have remembered you current Connection as context, because backwards navigation in the hierarchy is not possible.
Navigate to the connected ComponentPorts.
If you now want to know the corresponding ComponentNodes (the transitive parents of the ComponentPorts), you must restart a full tree search, this time on the left side.

If this example does not speak for itself, here are some reasons why you should never do it as just described:

Performance: The sketched algorithm has a runtime complexity of O(n + m) for an operation that has an inherent complexity of O(1) or near to it (with n=#all ComponentPort, m=#all ConnectionEnd).
If you apply this strategy more than once and if you use your functions in larger use cases (e.g. nested navigations), you easily end up with polynominal or even exponential complexity. The result on large data structures will then be, that you will literally have to wait ages for results that could be calculated in the blink of an eye.
Stability: In contrast to a direct navigation, you will always need knowledge of the context of navigation targets. To navigate correctly, all your deep searches / model traversals require that you are aware of all possible locations of your navigation targets. Suppose the underlying model changes in a way that e.g. Connections can be contained in a second place, other than the ConnectionSpecification. Your navigation algorithm will break, and you will not even know it.
Maintainability: Your implemented functionality is far more complicated, than the actual problem would require (Simple navigation from A to B vs. nested loops with stored states etc.). As a result, your code is harder to understand, more error prone, more difficult to change and test, and all this for no good reason.

The Caching Approach

In this approach, you calculate all reverse navigations in advance and store the results somewhere for future use - let’s call this “somewhere” Association Cache. After deserialization of the XML file, you basically carry out a complete model traversal and place all relevant associations in a cache for reverse lookup.

This solves the performance issue, by replacing the continuous overhead for each operation by a singular overhead at startup time. However, the stability and maintainability issues remain. There still is a complex model traversal necessary, your code still requires knowledge about possible navigations and you are violating the Single Responsibility Principle.

The Single Responsibility Principle states, that a code unit should have a responsibility for a single functionality, and that responsibility should be exclusive. In conflict to this, the “navigation responsibility” in this approach is shared between the JAXB generated class (for forward directions) and the association cache for backwards directions.

Another drawback of this approach is, that the cache has to be some kind of singleton object with a scope restricted to the current model. This object must be made accessible for all locations where a navigation in backwards direction is necessary. This can lead to very clumsy code.

Solution Outline

Describing the solution in all details with all the tiny knits and knots would break the scope of this article. Therefore I will just cover the main concepts and you will have to fill out the blanks in between by yourself or contact me directly for a discussion.

A clean solution of the challenge can be defined by some basic requirements:

Navigations in the model should be executed with a runtime complexity that is equal to the inherent complexity of navigation. Simple navigations to the context element or backwards navigations of an association should be O(1).
Developers implementing functions on the model should not have to worry about differences between forward or backward navigations (same access pattern).
The implementation should be easy to maintain and robust to model changes.

The way to achieve all of this, is the most obvious and simplest solution you can think of: We just make all associations navigable bidirectionally, with almost no runtime impact!

You may know bidirectional associations in Hibernate. The approach to this solution is nearly the same. To get it up and running we have two building blocks:

We need to modify the JAXB classes in such a way, that they support bidirectional navigation.
We need to initialize the associations correctly.

JAXB Class Customization

To support bidirectional navigations, we want JAXB classes that look like the modified VecComponentPort in the listing below. As you can see, it defines a parentComponentConnector property for navigation to the containing context element, as well as a refConnectionEnd property for all ConnectionEnds that define references to this ComponentPort. Both are marked as @XmlTransient, as bidirectional navigations are not supported by JAXB itself and they will be ignored by the JAXB (Un)Marshaller.

Modified VecComponentPort.java

    @XmlAccessorType(XmlAccessType.FIELD)
    @XmlType(name = "ComponentPort", propOrder = {
        "identification",
    })
    public class VecComponentPort
        extends VecConfigurableElement
        implements Serializable
    {

        @XmlElement(name = "Identification", required = true)
        protected String identification;

        @XmlTransient
        private VecComponentConnector parentComponentConnector;

        @XmlTransient
        private Set<VecConnectionEnd> refConnectionEnd = new HashSet<VecConnectionEnd>();

        public String getIdentification() {
            return identification;
        }

        public void setIdentification(String value) {
            this.identification = value;
        }

        public VecComponentConnector getParentComponentConnector() {
            return parentComponentConnector;
        }

        public Set<VecConnectionEnd> getRefConnectionEnd() {
            return refConnectionEnd;
        }
    }

But how do we achieve these modifications? As always, there are many ways. You could modify all classes manually, but this is a cumbersome task and modifying generated classes is an absolute NO GO!

Luckily, XJC provides a plugin mechanism that allows the execution of custom plugins during the generation process. An XJC plugin has access to the code model of the generated classes and can modify them. An example for a simple plugin can be found here.

The following code snippet can create a private field, with an optional initialization value and a corresponding public getter method in an existing class:

JCodeModel Example

    public JFieldVar build() {
        final JFieldVar field = targetClass.field(JMod.PRIVATE, baseType, name);
        if (init != null) {
            field.init(init);
        }

        field.annotate(codeModel.ref(XmlTransient.class));

        final JMethod getter = targetClass.method(JMod.PUBLIC, baseType, getterName);
        getter.body()
                ._return(JExpr.ref(field.name()));

        if (getterJavadoc != null) {
            getter.javadoc()
                    .addAll(getterJavadoc);
        }

        return field;
    }

Additionally, XJC plugins can be parameterized through JAXB bindings. With a JAXB Binding Customization, you can provide specific parameters for certain schema elements. This works for both JAXB standard functionality and XJC plugins. So, we implemented a JAXB plugin, that takes the information on references from the binding and modifies the generated code accordingly.

A part of the resulting binding is listed below. It contains two customizations:

<nav:parent/> defines, that a parent association should be created for this class, with the given type and name.
<nav:backref/> defines, that a reverse navigation should be created in the target class, with the given name.

As you may notice, this binding also fixes the type-safety issue of IDREF associations.

Custom JAXB Binding

   <jxb:bindings node="//xs:complexType[@name='ComponentPort']">
      <nav:parent name="parentComponentConnector" type="VecComponentConnector"/>
   </jxb:bindings>
   <jxb:bindings node="//xs:complexType[@name='ConnectionEnd']">
      <nav:parent name="parentConnection" type="VecConnection"/>
      <jxb:bindings node=".//xs:element[@name='ConnectedComponentPort']">
         <nav:backref name="refConnectionEnd"/>
         <jxb:property>
            <jxb:baseType name="de.foursoft.harness.somepackage.VecComponentPort"/>
         </jxb:property>
      </jxb:bindings>
   </jxb:bindings>

To improve the maintainability of this approach, the custom JAXB binding is not created manually. As mentioned in the beginning, the XML schema representation is not as concise as the UML model. Luckily, in our case, the XML schema is generated from an UML model that contains all the information we require. Therefore, we use the UML model to generate the JAXB binding via some powerful XSLT-scripts as well. Through this approach, it is guaranteed that the JAXB binding is always consistent to the XML schema and no elements are missed. As a side effect, we save the time required for writing and adapting it manually.

Model Initialization

With the first step we now have JAXB classes that fit our needs. However, we still need to initialize them correctly, as our extensions are custom and JAXB will ignore them. For the initialization of our extensions, we use the JAXB functionality of the Unmarshaller.Listener interface. An implementation of this interface will be notified about every unmarshalled object and its corresponding parent in the XML tree.

The advantage of using this mechanism is, that we do not have to implement our own model traversal and it is ensured that we are notified about every single instance, regardless of the enclosing context. Nevertheless, there are two topics to deal with:

The JAXB Unmarshaller is only able to handle one single Listener. For design reasons we decided to have a specific listener instance per Class. Therefore, the Listener interface is implemented by a ModelPostProcessorManager which delegates the notifications to the different ModelPostProcessor instances.
The initialization requires two phases:
1. The first phase is controlled by JAXB and we use the Listener.afterUnmarshall() events to collect all relevant objects for later initialization.
  The actual initialization cannot occur at this moment, as JAXB has not yet initialized the IDREF associations at this stage.
2. Once the unmarshalling process by JAXB is completed, we start the second phase for actual initialization. This is achieved by wrapping the Unmarshaller with the ExtendedUnmarshaller. The ExtendedUnmarshaller is responsible for the correct configuration of the Unmarshaller and it triggers the second phase by calling the ModelPostProcessorManager.doPostProcessing().

Model Initialization Draft

The initialization of the associations is carried out by a ReflectiveAssociationPostProcessor. This implementation uses reflection to gather the necessary information and to initialize the associations accordingly. To provide it with all required information, two annotations have been introduced that are added by the custom XJC plugin to store this information: @XmlParent and @XmlBackreference.

@XmlParent is used to mark a field in a class for initialization with the corresponding parent object. @XmlBackreference marks an IDREF association as bidirectional and defines the name of the reverse navigation property in target class.

Annotated VecConnectionEnd.java

    @XmlAccessorType(XmlAccessType.FIELD)
    @XmlType(name = "ConnectionEnd", propOrder = {
        "identification",
        "connectedComponentPort"
    })
    public class VecConnectionEnd
        extends VecConfigurableElement
        implements Serializable
    {
        @XmlElement(name = "Identification", required = true)
        protected String identification;

        @XmlElement(name = "ConnectedComponentPort", required = true, type = Object.class)
        @XmlIDREF
        @XmlSchemaType(name = "IDREF")
        @XmlBackReference(destinationField = "refConnectionEnd")
        protected VecComponentPort connectedComponentPort;
        @XmlTransient
        @XmlParent
        private VecConnection parentConnection;

        // Rest of the class is omitted.
        ....
    }

More Thoughts on This

I think the outlined solution speaks for itself, but keep in mind that this is just a short summary. Especially the concrete implementation of the model initialization should be carried out with extra caution, as otherwise you will have numerous opportunities to create new performance black holes.

When implemented correctly (without excessive performance optimizations), this solution adds about 10% of additional overhead to the pure JAXB unmarshalling time. For this, you receive a comfortable, intuitive navigation between JAXB objects with a constant runtime complexity.

However, that is not the end of the road! What happens when you want to navigate into the model, e.g. look-up objects by ID or search for specific criteria such as “find all objects of a specific class”? For now, I will leave this to your own creativity.

Johannes Becker

Managing Consultant

+49 89 5307 44-523

becker@4soft.de

Wie sind Ihre Erfahrungen mit großen XML Strukturen? Haben Sie andere Lösungsansätze? Ich freue mich über einen regen Austausch - kontaktieren Sie mich gern.