1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-30 11:03:19 +03:00

1. I've now produced an updated version (and called it 0.2) of my XML

parser interface code. It now uses libxml2 instead of expat (though I've
left the old code in the tarball). This means *proper* XPath support, and
the provided function allows you to wrap your result set in XML tags to
produce a new XML document.

John Gray
This commit is contained in:
Bruce Momjian
2001-08-21 00:39:20 +00:00
parent 44ae35cab9
commit 5950a984a7
4 changed files with 120 additions and 83 deletions

View File

@ -1,67 +1,57 @@
PGXML TODO List
===============
Some of these items still require much more thought! The data model
for XML documents and the parsing model of expat don't really fit so
well with a standard SQL model.
Some of these items still require much more thought! Since the first
release, the XPath support has improved (because I'm no longer using a
homemade algorithm!).
1. Generalised XML parsing support
1. Performance considerations
Allow a user to specify handlers (in any PL) to be used by the parser.
This must permit distinct sets of parser settings -user may want some
documents in a database to parsed with one set of handlers, others
with a different set.
At present each document is parsed to produce the DOM tree on every query.
i.e. the pgxml_parse function would take as parameters (document,
parsername) where parsername was the identifier for a collection of
handler etc. settings.
Pros:
Easy
No persistent memory or storage allocation for parsed trees
(libxml docs suggest representation of a document might
be 4 times the size of the text)
"Stub" handlers in the pgxml code would invoke the functions through
the standard fmgr interface. The parser interface would define the
prototype for these functions. How does the handler function know
which document/context has resulted it in being called?
Cons:
Slow/ CPU intensive to parse.
Makes it difficult for PLs to apply libxml manipulations to create
new documents or amend existing ones.
Mechanism for defining collection of parser settings (in a table? -but
maybe copied for efficiency into a structure when first required by a
query?)
2. Support for other parsers
2. XQuery
Expat may not be the best choice as a parser because a new parser
instance is needed for each document i.e. all the handlers must be set
again for each document. Another parser may have a more efficient way
of parsing a set of documents identically.
I'm not sure if the addition of XQuery would be best as a function or
as a new front-end parser. This is one to think about, but with a
decent implementation of XPath, one of the prerequisites is covered.
3. XPath support
3. DOM Interfaces
Proper XPath support. I really need to sit down and plough
through the specification...
Expose more aspects of the DOM to user functions/ PLs. This would
allow a procedure in a PL to run some queries and then use exposed
interfaces to libxml to create an XML document out of the query
results. I accept the argument that this might be more properly
performed on the client side.
The very simple text comparison system currently used is too
basic. Need to convert the path to an ordered list of nodes. Each node
is an element qualifier, and may have a list of attribute
qualifications attached. This probably requires lexx/yacc combination.
(James Clark has written a yacc grammar for XPath). Not all the
features of XPath are necessarily relevant.
4. Returning sets of documents from XPath queries.
An option to return subdocuments (i.e. subelements AND cdata, not just
cdata). This should maybe be the default.
4. Multiple occurences of elements.
This section is all very sketchy, and has various weaknesses.
Although the current implementation allows you to amalgamate the
returned results into a single document, it's quite possible that
you'd like to use the returned set of nodes as a source for FROM.
Is there a good way to optimise/index the results of certain XPath
operations to make them faster?:
select docid, pgxml_xpath(document,'/site/location',1) as location
where pgxml_xpath(document,'/site/name',1) = 'Church Farm';
select docid, pgxml_xpath(document,'//site/location/text()','','') as location
where pgxml_xpath(document,'//site/name/text()','','') = 'Church Farm';
and with multiple element occurences in a document?
select d.docid, pgxml_xpath(d.document,'/site/location',1)
select d.docid, pgxml_xpath(d.document,'//site/location/text()','','')
from docstore d,
pgxml_xpaths('docstore','document','feature/type','docid') ft
pgxml_xpaths('docstore','document','//feature/type/text()','docid') ft
where ft.key = d.docid and ft.value ='Limekiln';
pgxml_xpaths params are relname, attrname, xpath, returnkey. It would
@ -71,10 +61,15 @@ defined by relname and attrname.
The pgxml_xpaths function could be the basis of a functional index,
which could speed up the above query very substantially, working
through the normal query planner mechanism. Syntax above is fragile
through using names rather than OID.
through the normal query planner mechanism.
5. Return type support.
Better support for returning e.g. numeric or boolean values. I need to
get to grips with the returned data from libxml first.
John Gray <jgray@azuli.co.uk>
John Gray <jgray@azuli.co.uk> 16 August 2001