XML viewing and diffing

Another simple XML tool, and a neat recursive diffing algorithm. (Part of my growing toolkit of XML tools.)

A tool that’s helpful when dealing with large numbers of XML data files is a simple, user-friendly XML viewer.  Most browsers can view raw XML in their window.  They format the XML according to its structure by indenting nested elements and colourising element names, attributes, comments, and content.  They also provide buttons to dynamically open and close longer elements to hide their contents. Unfortunately they are generally pretty ugly.

It turns out that a little bit of XSLT can be used to transform any XML document into a convenient HTML representation, as used in the browsers.  (It’s possible that’s what the browsers use internally, though I would expect them to use a more direct translation from the XML input to browser elements.)  The initial incentive to writing one myself was that I could pick the format of the resulting file.  But other benefits materialised along the way.

Rendering XML in HTML

The basic program has templates to match each node kind in XML. Text nodes are handled by the default stylesheet rules, which simply copy them to the output. The template for comments renders the comment in a specific font, delimited by the XML syntax for comment beginnings and ends.

<xsl:template match="comment()">
        <span class="text">&lt;!--</span> <span class="comment"><xsl:value-of select="." /></span> <span class="text">--&gt;</span>

The element template renders a syntax-highlighted start tag for the element and its attributes, calls apply-templates on the node’s children, and renders an end-tag. Two forms are rendered: a short form, hidden by default, and a long form. A JavaScript action attached to the name in the start tag is used to hide one and unhide the other. There is some special casing for elements: an empty element is rendered as a single closed tag, and an element with a single piece of text content is rendered on a single line in long form only. The combined code is a bit complicated, and isn’t helped by XSLT’s own verbosity:

<xsl:template match="*">
    <!-- First render the content into a variable. -->
    <xsl:variable name="content">
        <xsl:apply-templates select="node()|comment()" />
    <!-- If there are any child elements, then it is closable, so render a short form. -->
    <xsl:if test="./*">
        <div class="short">
            <!-- Start tag. -->
            <xsl:call-template name="start-tag">
                <xsl:with-param name="class" select="'short-name'" />
            <!-- Indicator of hidden content. -->
            <!-- End tag. -->
            <span class="tag">&lt;/<xsl:value-of select="name()" />&gt;</span>
    <!-- Render a long form. -->
    <div class="long">
        <!-- Start tag; will be complete closed tag if there is no content. -->
        <xsl:call-template name="start-tag">
        <!-- But if there is content... -->
        <xsl:if test="./* or text() or comment()">
            <!-- If it's only a text element then render it inside the long form. -->
            <xsl:if test="normalize-space(text()) != ''">
                <span class="value"><xsl:value-of select="text()" /></span>
            <!-- But if there is element or comment data, then put the full rendered content, indented. -->
            <xsl:if test="./* or comment()">
                <div class="indent">
                    <xsl:copy-of select="$content" />
            <!-- Close tag. -->
            <span class="tag">&lt;/<xsl:value-of select="name()" />&gt;</span>

<xsl:template name="start-tag">
    <xsl:param name="class" select="'long-name'" />
    <span class="tag">&lt;<span>
        <!-- Set the CSS class of the node name's span. -->
        <xsl:if test="./*">
            <xsl:attribute name="class"><xsl:value-of select="$class" /></xsl:attribute>
        <xsl:value-of select="name()" />
        <!-- Render the attributes. -->
        <xsl:for-each select="@*">
            <xsl:text> </xsl:text>
            <span class="attr">
                <xsl:value-of select="name()" />
            </span>=&quot;<span class="text">
                <xsl:value-of select="." />
        <!-- If no content, close the tag. -->
        <xsl:if test="not(./*) and not(text()) and not(comment())"> /</xsl:if>&gt;</span>

The XML files we use tend to contain repeated, complex elements.  For example, a single file might contain a trade, including a block of elements describing the flows on it, each with specific dates, numerical amounts, and other parameters.  The problem is that in XML the data is extremely verbose: a whole screenful might suffice to describe only a single flow.  The half-dozen fields of interest are buried in sea of angle brackets of ridiculously camel-case long element names.

Luckily, it turns out to be simple to add more specific template rules to render a summary table instead of the usual XML verbosity for these repeated elements.  Each row initially shows only the most salient fields from the element, but can be dynamically expanded to show the long form of the original XML source for it. The content within the long form is itself rendered using the normal templates, and can be opened and closed dynamically.

Categories in the EJRH WordPress XML dump

This shows part of the WordPress XML dump of this blog. Categories are presented as a summarised table, and one of them has been expanded to show its XML source.

Comparing two XML files

There is a constant swarm of XML files, and there will be many versions of the XML for a given object. When something unexpected has happened, it’s often useful to compare two of these versions to pinpoint the precise elements that have changed. There are two levels of comparison work to be done. Automatated comparison involves taking two sets of XML files and summarising the differences between them. Doing this usefully requires some knowledge of which elements are expected to have changed, and how the remaining element differences should be summarised.

Manual comparison is more ad-hoc, but more specific: take two XML files, and browse over the differences between them. This can be done in a normal text file merge tool. But merge tools aren’t optimised for the XML case: they will throw up spurious differences with differing whitespace, and will completely ignore the structure of the two files. Such structure is extremely useful both to the tool that is attempting to match parts of the documents, and to a human who’s trying to understand the parts that don’t match.

In a further use of XSLT, it’s possible to write a stylesheet that will take two XML files as input, and output a merged form of them. Nodes that are common to both are copied normally, but nodes that appear only in one side or the other are output with marks that indicate which side they come from. The viewing stylesheet can use these marks to render the differing nodes specially. In our case, nodes only in the first document (the left side) are marked red to indicate they were deleted, and nodes only in the second document are marked green to indicate they were added.

The recursive structure of XML makes this a bit different from normal text diffs. In some ways it is easier, because if there is a difference deep within an element that appears common to both sides, the job of deciding whether it is an addition or deletion is contained only to that element. In diffing, it’s often easy to detect the start of a difference, but harder to find the end of it. The documents are no longer in sync when the difference is encountered, and heuristics are needed to locate a later point where they appear to be back in sync. With XML, that search can be safely confined to the single parent element.

It’s also a bit more complicated, because there are several possible cases when comparing nodes of both side.

  1. Nodes from L and R could be exactly the same.
  2. The node from L could have no corresponding node in R, because R is at the end of the parent element. (Or vice-versa.)
  3. Nodes from L and R could have different element names, and hence are different.
  4. Nodes from L and R could be the same superficially (e.g. have the same name), but different contents at some deeper level.

Cases one and two correspond to similar cases for text diffing. A line is obviously the same, or obviously an addition or removal because there are no more lines on the other side.

But in XML, elements can be the same at the top level, but different internally. The stylesheet computes a signature for each element, by default using only its kind (element, text, comment) and name (or text/comment contents). If the signatures are different, then it’s case 3 and one of the sides is an addition or deletion. But if the signatures are the same but the contents are not, then it recursively merges the contents of each element. A single element is output, but marked as changed (but not deleted or inserted). This indicates that somewhere inside it, one of the child elements must be a deletion or insertion.

Deletions and additions are themselves a bit tricky in case 3. The node from L differs from the one from R. But was L’s node deleted, or was R’s added, or both? If we choose incorrectly, it’s possible that the algorithm won’t be able to detect the end of the change and synchronise the two sides again. What should be a simple single-node change will be output as a large, unnecessarily complicated difference.

There is a heuristic to pick which side to process. Each side has a match point: this is the distance to the nearest similar element (with the same signature in the other side). The side with the highest match point is assumed to be the extra, unmatchable node: it is output as an addition or deletion (depending on its side). The merge is then continued with the successor. The theory is that its opponent has a lower match point, meaning that it will soon be merged against one of the following nodes and is probably not a difference after all.

There was one interesting bug with this algorithm. In two documents of the form:

<!-- L -->     |    <!-- R -->
<element1/>    |    <element3/>
<element2/>    |    <element4/>
<element3/>    |    <element5/>
<element4/>    |
<element5/>    |

The match point of L’s element1 was 999999 (corresponding to no possible match in R), so it was output as a deletion. But the match point of its successor, WHITESPACE, was 2, while the match point of R’s element3 was 4. Since element3 had a higher match point, it was output as an addition, even though it would eventually have merged perfectly with a later element in L. Subsequently, the two WHITESPACEs merged, then it began again with element2. The result was that all of L was output as an addition, even though it appears whole inside R.

Two ways of diffing two XML files; both are correct, but one is much easlier to read

This was fixed by outputing not just the single node from the side with the higher match point, but a sequence of nodes with length equal to one less than the opposing side’s match point. In the above example, L’s WHITESPACE has a match point of 999999, but the match point of R’s element3 is 5, so four nodes of L are output as additions. This brings both sides into sync and the remaining elements merge properly.

Making it useable

The viewer stylesheet creates self-contained HTML files with embedded JS and CSS. I’m not particularly talented with either, so it’s not perfect, but it’s useable.

A simple Python program uses libxml2 to parse the input files and apply the stylesheets. In Windows, this can be set up as a Send To option in the explorer context menu.

If the option -m is specified, the two inputs will be merged first. This can also be Send To option; if two items are selected when this option is chosen, Windows is smart enough to use the same program invocation for both.

It’s not lightning fast. Processing a large XML file can take a few seconds, including the cost of starting up Python and loading libxml2. It’s hard to estimate the cost of applying an XSLT stylesheet, but in general it seems to scale reasonably well over a variety of moderately large files. The merging algorithm is theoretically of O(n2) complexity, because each node may need to be compared against each node on the other side. But it’s still useable.

(I’ll put the stylesheets and Python scripts into my public repository in the near future.)

This entry was posted in Programming and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s