libxslt/doc/internals.html

<html>
<head>
  <title>The XML C library for Gnome</title>
  <meta name="GENERATOR" content="amaya V4.1">
  <meta http-equiv="Content-Type" content="text/html">
</head>

<body bgcolor="#ffffff">
<p><a href="http://www.gnome.org/"><img src="smallfootonly.gif"
alt="Gnome Logo"></a><a href="http://www.redhat.com"><img src="redhat.gif"
alt="Red Hat Logo"></a></p>

<h1 align="center">The XSLT C library for Gnome explained</h1>

<h1 style="text-align: center">How does it work ?</h1>

<p></p>
<ul>
  <li><a href="#Introducti">Introduction</a></li>
  <li><a href="#Basics">Basics</a></li>
  <li><a href="#Keep">Keep it simple stupid</a></li>
  <li><a href="#libxml">The libxml nodes</a></li>
  <li><a href="#XSLT">The XSLT processing steps</a></li>
  <li><a href="#XSLT1">The XSLT stylesheet compilation</a></li>
  <li><a href="#XSLT2">The XSLT template compilation</a></li>
  <li><a href="#processing">The processing itself</a></li>
  <li><a href="#XPath">XPath expressions compilation</a></li>
  <li><a href="#XPath1">XPath expression interpretation</a></li>
  <li><a href="#stack">The stack frame</a></li>
  <li><a href="#Extension">Extension support</a></li>
  <li><a href="#Futher">Further reading</a></li>
  <li><a href="#TODOs">TODOs</a></li>
</ul>

<h2><a name="Introducti">Introduction</a></h2>

<p>This document describes the processing of <a
href="http://xmlsoft.org/XSLT/">libxslt</a>, the <a
href="http://www.w3.org/TR/xslt">XSLT</a> C library developped for the <a
href="http://www.gnome.org/">Gnome</a> project.</p>

<h2><a name="Basics">Basics</a></h2>

<p>XSLT is a transformation language, taking an input document and a
stylesheet document, it generates an ouput document:</p>

<p align="center"><img src="processing.gif"
alt="the XSLT processing model"></p>

<p>Libxslt is written in C. It relies on libxml for the following
operations:</p>
<ul>
  <li>parsing files</li>
  <li>building the in-memory DOM strucure associated to the documents
  handled</li>
  <li>the XPath implementation</li>
  <li>serializing back the result document to XML, HTML (text is handled
    directly)</li>
</ul>

<h2><a name="Keep">Keep it simple stupid</a></h2>

<p>Libxslt is not very specialized, it is build under the assumption that all
nodes from the source and output document can fit in the virtual memory of the
system. There is a big trade-off there, it is fine for reasonably sized
documents but may not be suitable for large sets of data, the gain is that it
can be used in a relatively versatile way, the input or output may never be
serialized, but the size of documents it can handle are limited by the size of
the memory available.</p>

<p>More specialized memory handling approaches are possible, like building the
input tree from a serialization progressively as it is consumed, factoring
repetitive patterns, or even on-the-fly generation of the output as the input
is parsed but it is possible only for a limited subset of the stylesheets. In
general the implementation of libxslt follows the following pattern:</p>
<ul>
  <li>KISS (keep it simple stupid)</li>
  <li>when there is a clear bottleneck optimize on top of this simple
    framework and refine only as much as is needed to reach the expected
    result</li>
</ul>

<p>The result is not that bad, clearly one can do a better job but more
specialized too. Most optimization like building the tree on-demand would need
serious changes to the libxml XPath framework, an easy step would be to
serialize the output directly (or call a set of SAX-like ouptut handler to
keep this a flexible interface) and hence avoid the memory consumption of the
result.</p>

<h2><a name="libxml">The libxml nodes</a></h2>

<p>DOM like trees as used and generated by libxml and libxslt are relatively
complex. Most node types follow the given structure except a few variations
depending on the node type:</p>

<p align="center"><img src="node.gif" alt="description of a libxml node"></p>

<p>Nodes carry a <strong>name</strong> and the node <strong>type</strong>
indicates the kind of node it represents, the most common ones are:</p>
<ul>
  <li>document nodes</li>
  <li>element nodes</li>
  <li>text nodes</li>
</ul>

<p>For the XSLT processing, entity nodes should not be generated (i.e. they
should be replaced by their content). Most nodes also contains the following
"naviagtion" informations:</p>
<ul>
  <li>the containing <strong>doc</strong>ument</li>
  <li>the <strong>parent</strong> node</li>
  <li>the first <strong>children</strong> node</li>
  <li>the <strong>last</strong> children node</li>
  <li>the <strong>prev</strong>ious sibling</li>
  <li>the following sibling (<strong>next</strong>)</li>
</ul>

<p>Elements nodes carries the list of attributes in the properties, an
attribute itself holds the navigation pointers and the children list (the
attribute value is not represented as a simple string to allow usage of
entities references).</p>

<p>The <strong>ns</strong> points to the namespace declaration for the
namespace associated to the node, <strong>nsDef</strong> is the linked list of
namespace declaration present on element nodes.</p>

<p>Most nodes also carry an <strong>_private</strong> pointer which can be
used by the application to hold specific data on this node.</p>

<h2><a name="XSLT">The XSLT processing steps</a></h2>

<p>Basically there is a few steps which are clearly decoupled at the interface
level:</p>
<ol>
  <li>parse the stylesheet and generate an DOM tree</li>
  <li>take the stylesheet tree and build a compiled version of it it's the
    compilation phase</li>
  <li>the input and generate a DOM tree</li>
  <li>process the stylesheet against the input tree and generate an output
    tree</li>
  <li>serialize the output tree</li>
</ol>

<p>A few things should be noted here:</p>
<ul>
  <li>the steps 1/ 3/ and 5/ are optional</li>
  <li>the stylesheet optained at 2/ can be reused by multiple processing 4/
    (and this should also work in threaded programs)</li>
  <li>the tree provided in 2/ should never be freed using xmlFreeDoc, but by
    freeing the stylesheet.</li>
  <li>the input tree 4/ is not modified except the _private field which may be
    used for labelling keys if used by the stylesheet</li>
</ul>

<h2><a name="XSLT1">The XSLT stylesheet compilation</a></h2>

<p>This is the second step described. It takes a stylesheet tree, and
"compiles" it, basically it associates to each node a structure stored in the
_private field and containing informations computed in the stylesheet:</p>

<p align="center"><img src="stylesheet.gif"
alt="a compiled XSLT stylesheet"></p>

<p>One xsltStylesheet structure is generated per document parsed for the
stylesheet. XSLT documents allows includes and imports of other documents,
imports are stored in the <strong>imports</strong> list (hence keeping the
tree hierarchy of includes which is very important for a proper XSLT
processing model) and includes are stored in the <strong>doclist</strong>
list. An inported stylesheet has a parent link to allow to browse the
tree.</p>

<p>The DOM tree associated to the document is stored in <strong>doc</strong>,
it is preprocessed to remove ignorable empty nodes and all the nodes in the
XSLT namespace are subject to a precomputing. This usually consist of
extrating all the context informations from the context tree (attributes,
namespaces, XPath expressions), and store them in an xsltStylePreComp
structure associated to the <strong>_private</strong> field of the node.</p>

<p>A couple of notable exceptions to this are XSLT template nodes (more on
this later) and attribute value templates, if they are actually templates, the
value cannot be computed at compilation time (some preprocessing could be done
like isolation and preparsing of the XPath subexpressions but it's not done,
yet).</p>

<p>The xsltStylePreComp structure also allow to store the precompiled form of
an XPath expression which can be associated to an XSLT element (more on this
later).</p>

<h2><a name="XSLT2">The XSLT template compilation</a></h2>

<p>A proper handling of templates lookup is one of the key of fast XSLT
processing (given a node in the source document this is the processof finding
which templates should be applied to this node). Libxslt follows the hint
suggested in the <a href="http://www.w3.org/TR/xslt#patterns">5.2 Patterns</a>
section of the XSLT Recommendation, i.e. it doesn't evaluates it as an XPath
expression but tokenize it and compile it as a set of rules to be evaluated on
a candidate node. There is usually an indication of the node name in the last
step of this evaluation and this is used as a key check for the match. As a
result libxslt build a relatively more complex set of structures for the
templates:</p>

<p align="center"><img src="templates.gif"
alt="The templates related structure"></p>

<p>Let's describe a bit more closely what is built. First the xsltStylesheet
structure holds a pointer to the template hash table. All the XSLT patterns
compiled in this stylesheet are indexed by the value of the the target element
(or attribute, pi ...) name, so when a element or an attribute "foo" need to
be processed the lookup is done using the name as a key.</p>

<p>Each of the patterns are compiled into an xsltCompMatch structure, it holds
the set of rules based on the tokenization of the pattern basically stored in
reverse order (matching is easier this way). It also holds some information
about the previous matches used to speed up the process when one iterates over
a set of siblings (this optimization may be defeated by trashing when running
threaded computation, it's unclear taht this si a big deal in practice).
Predicates expression are not compiled at this stage, they may be at run-time
if needed, but in this case they are compiled as full XPath expressions (the
use of some fixed predicate can probably be optimized, they are not yet).</p>

<p>The xsltCompMatch are then stored in the hash table, the clash list is
itself sorted by priority of the template to implement "naturally" the XSLT
priority rules.</p>

<p>Associated to the compiled pattern is the xsltTemplate itself containing
the informations actually required for the processing of the pattern including
of course a pointer to the list of elements used for building the pattern
result.</p>

<p>Last but not least a number of patterns do not fit in the hash table
because they are not associated to a name, this is the case for patterns
applying to the root, any element, any attributes, text nodes, pi nodes, keys
etc. Those are stored independantly in the stylesheet structure as separate
linked lists of xsltCompMatch.</p>

<h2><a name="processing">The processing itself</a></h2>

<p>Well the processing is actually defined by the XSLT specification (the
basis of the algorithm are explained in <a
href="http://www.w3.org/TR/xslt#section-Introduction">the Introduction</a>
section). Basically it works by taking the root of the input document and
applying the following algorithm:</p>
<ol>
  <li>finding the template applying to it, basically this is a lookup in the
    template hash table, walking the hash list until the node satisfies all
    the steps of the pattern, then checking the appropriate(s) global
    templates to see if there isn't a higher priority rule to apply</li>
  <li>If there is no template, apply the default rule (recurse on the
    children)</li>
  <li>else  walk the content list of the selected templates, for each of them:
    <ul>
      <li>if the node are in the XSLT namespace then the node has a _private
        field pointing to the preprocessed values,  jump to the specific
      code</li>
      <li>if the node is in an extension namespace, lookup the associated
        behaviour</li>
      <li>otherwise copy the node.</li>
    </ul>
    <p>the closure is usualy done through the XSLT
    <strong>apply-templates</strong> construct recursing by applying the
    adequate template on the input node children or on the result of an
    associated XPath selection lookup</p>
  </li>
</ol>

<p>Note that large parts of the input tree may not be processed by a given
stylesheet and that on the opposite some may be processed multiple times
(often the case when a Table of Content is built).</p>

<p>The module <code>transform.c</code> is the one implementing most of this
logic, <strong>xsltApplyStylesheet()</strong> is the entry point, it allocates
an xsltTransformContext containing the following:</p>
<ul>
  <li>a pointer to the stylesheet being processed</li>
  <li>a stack of templates</li>
  <li>a stack of variables and parameters</li>
  <li>an XPath context</li>
  <li>the template mode</li>
  <li>current document</li>
  <li>current input node</li>
  <li>current selected node list</li>
  <li>the current insertion points in the output document</li>
  <li>a couple of hash table for extensions element and functions</li>
</ul>

<p>then a new document get allocated (HTML or XML depending on the type of
output), the user parameters and global variables and parameters are
evaluated. Then <strong>xsltProcessOneNode()</strong> which implements the
1-2-3 algorithm is called on the root element of the input. Step 1/ is
implemented by calling <strong>xsltGetTemplate()</strong>, step 2/ is
implemented by <strong>xsltDefaultProcessOneNode()</strong> and step 3/ is
implemented by <strong>xsltApplyOneTemplate()</strong>.</p>

<h2><a name="XPath">XPath expression compilation</a></h2>

<p>The XPath support is actually implemented in the libxml module (where it is
reused by the XPointer implementation). XPath is a relatively classic
expression language, the only uncommon feature is that it is working on XML
trees and hence has specific syntax and types to handle them.</p>

<p>XPath expressions are compiled using <strong>xmlXPathCompile()</strong> it
will take an expression string in input and generate a structure containing
the parsed expression tree, for example the expression:</p>
<pre>/doc/chapter[title='Introduction']</pre>

<p>will be compiled as</p>
<pre>Compiled Expression : 10 elements
  SORT
    COLLECT  'child' 'name' 'node' chapter
      COLLECT  'child' 'name' 'node' doc
        ROOT
      PREDICATE
        SORT
          EQUAL =
            COLLECT  'child' 'name' 'node' title
              NODE
            ELEM Object is a string : Introduction
              COLLECT  'child' 'name' 'node' title
                NODE</pre>

<p>This can be tested using the  <code>testXPath</code>  command (in the
libxml codebase) using the <code>--tree</code> option</p>

<p>Again, the KISS approach is used, no optimization is done, this could be an
interesting thing to add (<a
href="http://www-106.ibm.com/developerworks/library/x-xslt2/?dwzone=x?open&amp;l=132%2ct=gr%2c+p=saxon">Michael
Kay describes</a> a lot of possible and interesting optimizations done in
Saxon which would be possible at this level), I'm unsure they would provide
much gain since the expressions tends to be relatively simple in general and
stylesheets are still hand generated. Optimizations at the interpretation
sounds likely to be more efficient.</p>

<h2><a name="XPath1">XPath interpretation</a></h2>

<p>@@</p>

<h2><a name="stack">The stack frame</a></h2>

<p>@@</p>

<h2><a name="Extension">Extension support</a></h2>

<p>@@</p>

<h2><a name="Futher">Futher reading</a></h2>

<p>@@</p>

<h2><a name="TODOs">TODOs</a></h2>

<p>@@</p>

<p></p>

<p><a href="mailto:Daniel.Veillard@imag.fr">Daniel Veillard</a></p>

<p>$Id$</p>
</body>
</html>