Gnome LogoRed Hat Logo

The XSLT C library for Gnome explained

How does it work ?

Introduction

This document describes the processing of libxslt, the XSLT C library developped for the Gnome project.

Basics

XSLT is a transformation language, taking an input document and a stylesheet document, it generates an ouput document:

the XSLT processing model

Libxslt is written in C. It relies on libxml for the following operations:

Keep it simple stupid

Libxslt is not very specialized, it is build under the assumption that all nodes from the source and output document can fit in the virtual memory of the system. There is a big trade-off there, it is fine for reasonably sized documents but may not be suitable for large sets of data, the gain is that it can be used in a relatively versatile way, the input or output may never be serialized, but the size of documents it can handle are limited by the size of the memory available.

More specialized memory handling approaches are possible, like building the input tree from a serialization progressively as it is consumed, factoring repetitive patterns, or even on-the-fly generation of the output as the input is parsed but it is possible only for a limited subset of the stylesheets. In general the implementation of libxslt follows the following pattern:

The result is not that bad, clearly one can do a better job but more specialized too. Most optimization like building the tree on-demand would need serious changes to the libxml XPath framework, an easy step would be to serialize the output directly (or call a set of SAX-like ouptut handler to keep this a flexible interface) and hence avoid the memory consumption of the result.

The libxml nodes

DOM like trees as used and generated by libxml and libxslt are relatively complex. Most node types follow the given structure except a few variations depending on the node type:

description of a libxml node

Nodes carry a name and the node type indicates the kind of node it represents, the most common ones are:

For the XSLT processing, entity nodes should not be generated (i.e. they should be replaced by their content). Most nodes also contains the following "naviagtion" informations:

Elements nodes carries the list of attributes in the properties, an attribute itself holds the navigation pointers and the children list (the attribute value is not represented as a simple string to allow usage of entities references).

The ns points to the namespace declaration for the namespace associated to the node, nsDef is the linked list of namespace declaration present on element nodes.

Most nodes also carry an _private pointer which can be used by the application to hold specific data on this node.

The XSLT processing steps

Basically there is a few steps which are clearly decoupled at the interface level:

  1. parse the stylesheet and generate an DOM tree
  2. take the stylesheet tree and build a compiled version of it it's the compilation phase
  3. the input and generate a DOM tree
  4. process the stylesheet against the input tree and generate an output tree
  5. serialize the output tree

A few things should be noted here:

The XSLT stylesheet compilation

This is the second step described. It takes a stylesheet tree, and "compiles" it, basically it associates to each node a structure stored in the _private field and containing informations computed in the stylesheet:

a compiled XSLT stylesheet

One xsltStylesheet structure is generated per document parsed for the stylesheet. XSLT documents allows includes and imports of other documents, imports are stored in the imports list (hence keeping the tree hierarchy of includes which is very important for a proper XSLT processing model) and includes are stored in the doclist list. An inported stylesheet has a parent link to allow to browse the tree.

The DOM tree associated to the document is stored in doc, it is preprocessed to remove ignorable empty nodes and all the nodes in the XSLT namespace are subject to a precomputing. This usually consist of extrating all the context informations from the context tree (attributes, namespaces, XPath expressions), and store them in an xsltStylePreComp structure associated to the _private field of the node.

A couple of notable exceptions to this are XSLT template nodes (more on this later) and attribute value templates, if they are actually templates, the value cannot be computed at compilation time (some preprocessing could be done like isolation and preparsing of the XPath subexpressions but it's not done, yet).

The xsltStylePreComp structure also allow to store the precompiled form of an XPath expression which can be associated to an XSLT element (more on this later).

The XSLT template compilation

A proper handling of templates lookup is one of the key of fast XSLT processing (given a node in the source document this is the processof finding which templates should be applied to this node). Libxslt follows the hint suggested in the 5.2 Patterns section of the XSLT Recommendation, i.e. it doesn't evaluates it as an XPath expression but tokenize it and compile it as a set of rules to be evaluated on a candidate node. There is usually an indication of the node name in the last step of this evaluation and this is used as a key check for the match. As a result libxslt build a relatively more complex set of structures for the templates:

The templates related structure

Let's describe a bit more closely what is built. First the xsltStylesheet structure holds a pointer to the template hash table. All the XSLT patterns compiled in this stylesheet are indexed by the value of the the target element (or attribute, pi ...) name, so when a element or an attribute "foo" need to be processed the lookup is done using the name as a key.

Each of the patterns are compiled into an xsltCompMatch structure, it holds the set of rules based on the tokenization of the pattern basically stored in reverse order (matching is easier this way). It also holds some information about the previous matches used to speed up the process when one iterates over a set of siblings (this optimization may be defeated by trashing when running threaded computation, it's unclear taht this si a big deal in practice). Predicates expression are not compiled at this stage, they may be at run-time if needed, but in this case they are compiled as full XPath expressions (the use of some fixed predicate can probably be optimized, they are not yet).

The xsltCompMatch are then stored in the hash table, the clash list is itself sorted by priority of the template to implement "naturally" the XSLT priority rules.

Associated to the compiled pattern is the xsltTemplate itself containing the informations actually required for the processing of the pattern including of course a pointer to the list of elements used for building the pattern result.

Last but not least a number of patterns do not fit in the hash table because they are not associated to a name, this is the case for patterns applying to the root, any element, any attributes, text nodes, pi nodes, keys etc. Those are stored independantly in the stylesheet structure as separate linked lists of xsltCompMatch.

The processing itself

Well the processing is actually defined by the XSLT specification (the basis of the algorithm are explained in the Introduction section). Basically it works by taking the root of the input document and applying the following algorithm:

  1. finding the template applying to it, basically this is a lookup in the template hash table, walking the hash list until the node satisfies all the steps of the pattern, then checking the appropriate(s) global templates to see if there isn't a higher priority rule to apply
  2. If there is no template, apply the default rule (recurse on the children)
  3. else walk the content list of the selected templates, for each of them:

    the closure is usualy done through the XSLT apply-templates construct recursing by applying the adequate template on the input node children or on the result of an associated XPath selection lookup

Note that large parts of the input tree may not be processed by a given stylesheet and that on the opposite some may be processed multiple times (often the case when a Table of Content is built).

The module transform.c is the one implementing most of this logic, xsltApplyStylesheet() is the entry point, it allocates an xsltTransformContext containing the following:

then a new document get allocated (HTML or XML depending on the type of output), the user parameters and global variables and parameters are evaluated. Then xsltProcessOneNode() which implements the 1-2-3 algorithm is called on the root element of the input. Step 1/ is implemented by calling xsltGetTemplate(), step 2/ is implemented by xsltDefaultProcessOneNode() and step 3/ is implemented by xsltApplyOneTemplate().

XPath expression compilation

The XPath support is actually implemented in the libxml module (where it is reused by the XPointer implementation). XPath is a relatively classic expression language, the only uncommon feature is that it is working on XML trees and hence has specific syntax and types to handle them.

XPath expressions are compiled using xmlXPathCompile() it will take an expression string in input and generate a structure containing the parsed expression tree, for example the expression:

/doc/chapter[title='Introduction']

will be compiled as

Compiled Expression : 10 elements
  SORT
    COLLECT  'child' 'name' 'node' chapter
      COLLECT  'child' 'name' 'node' doc
        ROOT
      PREDICATE
        SORT
          EQUAL =
            COLLECT  'child' 'name' 'node' title
              NODE
            ELEM Object is a string : Introduction
              COLLECT  'child' 'name' 'node' title
                NODE

This can be tested using the testXPath command (in the libxml codebase) using the --tree option

Again, the KISS approach is used, no optimization is done, this could be an interesting thing to add (Michael Kay describes a lot of possible and interesting optimizations done in Saxon which would be possible at this level), I'm unsure they would provide much gain since the expressions tends to be relatively simple in general and stylesheets are still hand generated. Optimizations at the interpretation sounds likely to be more efficient.

XPath interpretation

@@

The stack frame

@@

Extension support

@@

Futher reading

@@

TODOs

@@

Daniel Veillard

$Id$