mirror of
https://gitlab.gnome.org/GNOME/libxslt
synced 2025-11-05 12:10:38 +03:00
360 lines
15 KiB
HTML
360 lines
15 KiB
HTML
<html>
|
|
<head>
|
|
<title>The XML C library for Gnome</title>
|
|
<meta name="GENERATOR" content="amaya V4.1">
|
|
<meta http-equiv="Content-Type" content="text/html">
|
|
</head>
|
|
|
|
<body bgcolor="#ffffff">
|
|
<p><a href="http://www.gnome.org/"><img src="smallfootonly.gif"
|
|
alt="Gnome Logo"></a><a href="http://www.redhat.com"><img src="redhat.gif"
|
|
alt="Red Hat Logo"></a></p>
|
|
|
|
<h1 align="center">The XSLT C library for Gnome explained</h1>
|
|
|
|
<h1 style="text-align: center">How does it work ?</h1>
|
|
|
|
<p></p>
|
|
<ul>
|
|
<li><a href="#Introducti">Introduction</a></li>
|
|
<li><a href="#Basics">Basics</a></li>
|
|
<li><a href="#Keep">Keep it simple stupid</a></li>
|
|
<li><a href="#libxml">The libxml nodes</a></li>
|
|
<li><a href="#XSLT">The XSLT processing steps</a></li>
|
|
<li><a href="#XSLT1">The XSLT stylesheet compilation</a></li>
|
|
<li><a href="#XSLT2">The XSLT template compilation</a></li>
|
|
<li><a href="#processing">The processing itself</a></li>
|
|
<li><a href="#XPath">XPath expressions compilation</a></li>
|
|
<li><a href="#XPath1">XPath expression interpretation</a></li>
|
|
<li><a href="#stack">The stack frame</a></li>
|
|
<li><a href="#Extension">Extension support</a></li>
|
|
<li><a href="#Futher">Further reading</a></li>
|
|
<li><a href="#TODOs">TODOs</a></li>
|
|
</ul>
|
|
|
|
<h2><a name="Introducti">Introduction</a></h2>
|
|
|
|
<p>This document describes the processing of <a
|
|
href="http://xmlsoft.org/XSLT/">libxslt</a>, the <a
|
|
href="http://www.w3.org/TR/xslt">XSLT</a> C library developped for the <a
|
|
href="http://www.gnome.org/">Gnome</a> project.</p>
|
|
|
|
<h2><a name="Basics">Basics</a></h2>
|
|
|
|
<p>XSLT is a transformation language, taking an input document and a
|
|
stylesheet document, it generates an ouput document:</p>
|
|
|
|
<p align="center"><img src="processing.gif"
|
|
alt="the XSLT processing model"></p>
|
|
|
|
<p>Libxslt is written in C. It relies on libxml for the following
|
|
operations:</p>
|
|
<ul>
|
|
<li>parsing files</li>
|
|
<li>building the in-memory DOM strucure associated to the documents
|
|
handled</li>
|
|
<li>the XPath implementation</li>
|
|
<li>serializing back the result document to XML, HTML (text is handled
|
|
directly)</li>
|
|
</ul>
|
|
|
|
<h2><a name="Keep">Keep it simple stupid</a></h2>
|
|
|
|
<p>Libxslt is not very specialized, it is build under the assumption that all
|
|
nodes from the source and output document can fit in the virtual memory of the
|
|
system. There is a big trade-off there, it is fine for reasonably sized
|
|
documents but may not be suitable for large sets of data, the gain is that it
|
|
can be used in a relatively versatile way, the input or output may never be
|
|
serialized, but the size of documents it can handle are limited by the size of
|
|
the memory available.</p>
|
|
|
|
<p>More specialized memory handling approaches are possible, like building the
|
|
input tree from a serialization progressively as it is consumed, factoring
|
|
repetitive patterns, or even on-the-fly generation of the output as the input
|
|
is parsed but it is possible only for a limited subset of the stylesheets. In
|
|
general the implementation of libxslt follows the following pattern:</p>
|
|
<ul>
|
|
<li>KISS (keep it simple stupid)</li>
|
|
<li>when there is a clear bottleneck optimize on top of this simple
|
|
framework and refine only as much as is needed to reach the expected
|
|
result</li>
|
|
</ul>
|
|
|
|
<p>The result is not that bad, clearly one can do a better job but more
|
|
specialized too. Most optimization like building the tree on-demand would need
|
|
serious changes to the libxml XPath framework, an easy step would be to
|
|
serialize the output directly (or call a set of SAX-like ouptut handler to
|
|
keep this a flexible interface) and hence avoid the memory consumption of the
|
|
result.</p>
|
|
|
|
<h2><a name="libxml">The libxml nodes</a></h2>
|
|
|
|
<p>DOM like trees as used and generated by libxml and libxslt are relatively
|
|
complex. Most node types follow the given structure except a few variations
|
|
depending on the node type:</p>
|
|
|
|
<p align="center"><img src="node.gif" alt="description of a libxml node"></p>
|
|
|
|
<p>Nodes carry a <strong>name</strong> and the node <strong>type</strong>
|
|
indicates the kind of node it represents, the most common ones are:</p>
|
|
<ul>
|
|
<li>document nodes</li>
|
|
<li>element nodes</li>
|
|
<li>text nodes</li>
|
|
</ul>
|
|
|
|
<p>For the XSLT processing, entity nodes should not be generated (i.e. they
|
|
should be replaced by their content). Most nodes also contains the following
|
|
"naviagtion" informations:</p>
|
|
<ul>
|
|
<li>the containing <strong>doc</strong>ument</li>
|
|
<li>the <strong>parent</strong> node</li>
|
|
<li>the first <strong>children</strong> node</li>
|
|
<li>the <strong>last</strong> children node</li>
|
|
<li>the <strong>prev</strong>ious sibling</li>
|
|
<li>the following sibling (<strong>next</strong>)</li>
|
|
</ul>
|
|
|
|
<p>Elements nodes carries the list of attributes in the properties, an
|
|
attribute itself holds the navigation pointers and the children list (the
|
|
attribute value is not represented as a simple string to allow usage of
|
|
entities references).</p>
|
|
|
|
<p>The <strong>ns</strong> points to the namespace declaration for the
|
|
namespace associated to the node, <strong>nsDef</strong> is the linked list of
|
|
namespace declaration present on element nodes.</p>
|
|
|
|
<p>Most nodes also carry an <strong>_private</strong> pointer which can be
|
|
used by the application to hold specific data on this node.</p>
|
|
|
|
<h2><a name="XSLT">The XSLT processing steps</a></h2>
|
|
|
|
<p>Basically there is a few steps which are clearly decoupled at the interface
|
|
level:</p>
|
|
<ol>
|
|
<li>parse the stylesheet and generate an DOM tree</li>
|
|
<li>take the stylesheet tree and build a compiled version of it it's the
|
|
compilation phase</li>
|
|
<li>the input and generate a DOM tree</li>
|
|
<li>process the stylesheet against the input tree and generate an output
|
|
tree</li>
|
|
<li>serialize the output tree</li>
|
|
</ol>
|
|
|
|
<p>A few things should be noted here:</p>
|
|
<ul>
|
|
<li>the steps 1/ 3/ and 5/ are optional</li>
|
|
<li>the stylesheet optained at 2/ can be reused by multiple processing 4/
|
|
(and this should also work in threaded programs)</li>
|
|
<li>the tree provided in 2/ should never be freed using xmlFreeDoc, but by
|
|
freeing the stylesheet.</li>
|
|
<li>the input tree 4/ is not modified except the _private field which may be
|
|
used for labelling keys if used by the stylesheet</li>
|
|
</ul>
|
|
|
|
<h2><a name="XSLT1">The XSLT stylesheet compilation</a></h2>
|
|
|
|
<p>This is the second step described. It takes a stylesheet tree, and
|
|
"compiles" it, basically it associates to each node a structure stored in the
|
|
_private field and containing informations computed in the stylesheet:</p>
|
|
|
|
<p align="center"><img src="stylesheet.gif"
|
|
alt="a compiled XSLT stylesheet"></p>
|
|
|
|
<p>One xsltStylesheet structure is generated per document parsed for the
|
|
stylesheet. XSLT documents allows includes and imports of other documents,
|
|
imports are stored in the <strong>imports</strong> list (hence keeping the
|
|
tree hierarchy of includes which is very important for a proper XSLT
|
|
processing model) and includes are stored in the <strong>doclist</strong>
|
|
list. An inported stylesheet has a parent link to allow to browse the
|
|
tree.</p>
|
|
|
|
<p>The DOM tree associated to the document is stored in <strong>doc</strong>,
|
|
it is preprocessed to remove ignorable empty nodes and all the nodes in the
|
|
XSLT namespace are subject to a precomputing. This usually consist of
|
|
extrating all the context informations from the context tree (attributes,
|
|
namespaces, XPath expressions), and store them in an xsltStylePreComp
|
|
structure associated to the <strong>_private</strong> field of the node.</p>
|
|
|
|
<p>A couple of notable exceptions to this are XSLT template nodes (more on
|
|
this later) and attribute value templates, if they are actually templates, the
|
|
value cannot be computed at compilation time (some preprocessing could be done
|
|
like isolation and preparsing of the XPath subexpressions but it's not done,
|
|
yet).</p>
|
|
|
|
<p>The xsltStylePreComp structure also allow to store the precompiled form of
|
|
an XPath expression which can be associated to an XSLT element (more on this
|
|
later).</p>
|
|
|
|
<h2><a name="XSLT2">The XSLT template compilation</a></h2>
|
|
|
|
<p>A proper handling of templates lookup is one of the key of fast XSLT
|
|
processing (given a node in the source document this is the processof finding
|
|
which templates should be applied to this node). Libxslt follows the hint
|
|
suggested in the <a href="http://www.w3.org/TR/xslt#patterns">5.2 Patterns</a>
|
|
section of the XSLT Recommendation, i.e. it doesn't evaluates it as an XPath
|
|
expression but tokenize it and compile it as a set of rules to be evaluated on
|
|
a candidate node. There is usually an indication of the node name in the last
|
|
step of this evaluation and this is used as a key check for the match. As a
|
|
result libxslt build a relatively more complex set of structures for the
|
|
templates:</p>
|
|
|
|
<p align="center"><img src="templates.gif"
|
|
alt="The templates related structure"></p>
|
|
|
|
<p>Let's describe a bit more closely what is built. First the xsltStylesheet
|
|
structure holds a pointer to the template hash table. All the XSLT patterns
|
|
compiled in this stylesheet are indexed by the value of the the target element
|
|
(or attribute, pi ...) name, so when a element or an attribute "foo" need to
|
|
be processed the lookup is done using the name as a key.</p>
|
|
|
|
<p>Each of the patterns are compiled into an xsltCompMatch structure, it holds
|
|
the set of rules based on the tokenization of the pattern basically stored in
|
|
reverse order (matching is easier this way). It also holds some information
|
|
about the previous matches used to speed up the process when one iterates over
|
|
a set of siblings (this optimization may be defeated by trashing when running
|
|
threaded computation, it's unclear taht this si a big deal in practice).
|
|
Predicates expression are not compiled at this stage, they may be at run-time
|
|
if needed, but in this case they are compiled as full XPath expressions (the
|
|
use of some fixed predicate can probably be optimized, they are not yet).</p>
|
|
|
|
<p>The xsltCompMatch are then stored in the hash table, the clash list is
|
|
itself sorted by priority of the template to implement "naturally" the XSLT
|
|
priority rules.</p>
|
|
|
|
<p>Associated to the compiled pattern is the xsltTemplate itself containing
|
|
the informations actually required for the processing of the pattern including
|
|
of course a pointer to the list of elements used for building the pattern
|
|
result.</p>
|
|
|
|
<p>Last but not least a number of patterns do not fit in the hash table
|
|
because they are not associated to a name, this is the case for patterns
|
|
applying to the root, any element, any attributes, text nodes, pi nodes, keys
|
|
etc. Those are stored independantly in the stylesheet structure as separate
|
|
linked lists of xsltCompMatch.</p>
|
|
|
|
<h2><a name="processing">The processing itself</a></h2>
|
|
|
|
<p>Well the processing is actually defined by the XSLT specification (the
|
|
basis of the algorithm are explained in <a
|
|
href="http://www.w3.org/TR/xslt#section-Introduction">the Introduction</a>
|
|
section). Basically it works by taking the root of the input document and
|
|
applying the following algorithm:</p>
|
|
<ol>
|
|
<li>finding the template applying to it, basically this is a lookup in the
|
|
template hash table, walking the hash list until the node satisfies all
|
|
the steps of the pattern, then checking the appropriate(s) global
|
|
templates to see if there isn't a higher priority rule to apply</li>
|
|
<li>If there is no template, apply the default rule (recurse on the
|
|
children)</li>
|
|
<li>else walk the content list of the selected templates, for each of them:
|
|
<ul>
|
|
<li>if the node are in the XSLT namespace then the node has a _private
|
|
field pointing to the preprocessed values, jump to the specific
|
|
code</li>
|
|
<li>if the node is in an extension namespace, lookup the associated
|
|
behaviour</li>
|
|
<li>otherwise copy the node.</li>
|
|
</ul>
|
|
<p>the closure is usualy done through the XSLT
|
|
<strong>apply-templates</strong> construct recursing by applying the
|
|
adequate template on the input node children or on the result of an
|
|
associated XPath selection lookup</p>
|
|
</li>
|
|
</ol>
|
|
|
|
<p>Note that large parts of the input tree may not be processed by a given
|
|
stylesheet and that on the opposite some may be processed multiple times
|
|
(often the case when a Table of Content is built).</p>
|
|
|
|
<p>The module <code>transform.c</code> is the one implementing most of this
|
|
logic, <strong>xsltApplyStylesheet()</strong> is the entry point, it allocates
|
|
an xsltTransformContext containing the following:</p>
|
|
<ul>
|
|
<li>a pointer to the stylesheet being processed</li>
|
|
<li>a stack of templates</li>
|
|
<li>a stack of variables and parameters</li>
|
|
<li>an XPath context</li>
|
|
<li>the template mode</li>
|
|
<li>current document</li>
|
|
<li>current input node</li>
|
|
<li>current selected node list</li>
|
|
<li>the current insertion points in the output document</li>
|
|
<li>a couple of hash table for extensions element and functions</li>
|
|
</ul>
|
|
|
|
<p>then a new document get allocated (HTML or XML depending on the type of
|
|
output), the user parameters and global variables and parameters are
|
|
evaluated. Then <strong>xsltProcessOneNode()</strong> which implements the
|
|
1-2-3 algorithm is called on the root element of the input. Step 1/ is
|
|
implemented by calling <strong>xsltGetTemplate()</strong>, step 2/ is
|
|
implemented by <strong>xsltDefaultProcessOneNode()</strong> and step 3/ is
|
|
implemented by <strong>xsltApplyOneTemplate()</strong>.</p>
|
|
|
|
<h2><a name="XPath">XPath expression compilation</a></h2>
|
|
|
|
<p>The XPath support is actually implemented in the libxml module (where it is
|
|
reused by the XPointer implementation). XPath is a relatively classic
|
|
expression language, the only uncommon feature is that it is working on XML
|
|
trees and hence has specific syntax and types to handle them.</p>
|
|
|
|
<p>XPath expressions are compiled using <strong>xmlXPathCompile()</strong> it
|
|
will take an expression string in input and generate a structure containing
|
|
the parsed expression tree, for example the expression:</p>
|
|
<pre>/doc/chapter[title='Introduction']</pre>
|
|
|
|
<p>will be compiled as</p>
|
|
<pre>Compiled Expression : 10 elements
|
|
SORT
|
|
COLLECT 'child' 'name' 'node' chapter
|
|
COLLECT 'child' 'name' 'node' doc
|
|
ROOT
|
|
PREDICATE
|
|
SORT
|
|
EQUAL =
|
|
COLLECT 'child' 'name' 'node' title
|
|
NODE
|
|
ELEM Object is a string : Introduction
|
|
COLLECT 'child' 'name' 'node' title
|
|
NODE</pre>
|
|
|
|
<p>This can be tested using the <code>testXPath</code> command (in the
|
|
libxml codebase) using the <code>--tree</code> option</p>
|
|
|
|
<p>Again, the KISS approach is used, no optimization is done, this could be an
|
|
interesting thing to add (<a
|
|
href="http://www-106.ibm.com/developerworks/library/x-xslt2/?dwzone=x?open&l=132%2ct=gr%2c+p=saxon">Michael
|
|
Kay describes</a> a lot of possible and interesting optimizations done in
|
|
Saxon which would be possible at this level), I'm unsure they would provide
|
|
much gain since the expressions tends to be relatively simple in general and
|
|
stylesheets are still hand generated. Optimizations at the interpretation
|
|
sounds likely to be more efficient.</p>
|
|
|
|
<h2><a name="XPath1">XPath interpretation</a></h2>
|
|
|
|
<p>@@</p>
|
|
|
|
<h2><a name="stack">The stack frame</a></h2>
|
|
|
|
<p>@@</p>
|
|
|
|
<h2><a name="Extension">Extension support</a></h2>
|
|
|
|
<p>@@</p>
|
|
|
|
<h2><a name="Futher">Futher reading</a></h2>
|
|
|
|
<p>@@</p>
|
|
|
|
<h2><a name="TODOs">TODOs</a></h2>
|
|
|
|
<p>@@</p>
|
|
|
|
<p></p>
|
|
|
|
<p><a href="mailto:Daniel.Veillard@imag.fr">Daniel Veillard</a></p>
|
|
|
|
<p>$Id$</p>
|
|
</body>
|
|
</html>
|