looping 1000 time on an error stating that a nodeset has
grown out of control is useless, make sure we percolate
error up to the various loops and break when errors occurs
Example xmlXPathNormalizeFunction() would do CHECK_ARITY(1)
and the expect valuePop(ctxt); to return an object, except
now valuePop() looks at the XPath stack frames and fails returning
NULL, and we end up crashing dereferencing the object.
Real solution is to exten CHECK_ARITY() and recompile all
XPath functions using it.
As suggested by Andrew W. Nosenko:
Proposal: expose the new xmlBufShrink() to the "public" API for
compatibility with xmlBufUse().
Reason: the following scenario:
1. Read something into xmlParserInputBuffer (e.g. using
xmlParserInputBufferRead())
2. Extract content through xmlBufContent()
3. Extract content length through xmlBufUse(). Result have type
'size_t'.
4. Use this content
5. Now, you need to shrink the buffer. How to do it? Doing that
through legacy xmlBufferShrink() is unsafe because it uses 'unsigned
int' and the whole point of introducing the new API was handling the
cases, when 'unsigned int' is not enough. Therefore, need to use the
new xmlBufShrink(). But it is "private".
Therefore, I propose to expose the new xmlBufShrink() in the same way,
as xmlBufContent() and xmlBufUse() are exposed.
Things now work correctly at the xmllint level:
thinkpad:~/XML -> xmllint --sax --noout --schema test_schema.xsd
test_xml.xml
test_xml.xml:72721: Schemas validity error : Element 'level1': Missing
child element(s). Expected is ( level2 ).
test_xml.xml fails to validate
thinkpad:~/XML -> xmllint --stream --schema test_schema.xsd test_xml.xml
test_xml.xml:72721: Schemas validity error : Element 'level1': Missing
child element(s). Expected is ( level2 ).
test_xml.xml fails to validate
thinkpad:~/XML ->
* error.c: fix a corner case of not reporting lines when we should
* include/libxml/xmlschemas.h doc/symbols.xml: had to add new entry
points to set the filename on a validation context and a locator
callback used to fetch the line and file from the context
* xmlschemas.c: add the new entry points xmlSchemaValidateSetFilename()
and xmlSchemaValidateSetLocator(), plus make sure the error reporting
routine gets the information if available. Add a locator for SAX.
* xmlreader.c: add and plug a locator for readers.
Fix the lack of line number as reported by Johan Corveleyn <jcorvel@gmail.com>
* parser.c include/libxml/parser.h: add an XML_PARSE_BIG_LINES parser
option not switch on by default, it's an opt-in
* SAX2.c: if XML_PARSE_BIG_LINES is set store the long line numbers
in the psvi field of text nodes
* tree.c: expand xmlGetLineNo to extract those informations, also
make sure we can't fail on recursive behaviour
* error.c: in __xmlRaiseError, if a node is provided, call
xmlGetLineNo() if we can't get a valid line number.
* xmllint.c: switch on XML_PARSE_BIG_LINES in xmllint
Various cleanups
* configure.in: force regeneration of APIs in my environment
* buf.c buf.h enc.h encoding.c include/libxml/tree.h
include/libxml/xmlerror.h save.h tree.c: various comment cleanups
pointed by apibuild
* doc/apibuild.py: added the 3 new internal headers in the excludes
* doc/libxml2-api.xml doc/libxml2-refs.xml: regenerated the API
* doc/symbols.xml: listing new entry points for 2.9.0
* doc/devhelp/*: regenerated
Makefile.am:
* Don't use @VAR@, use $(VAR). Autoconf's AC_SUBST provides us the Make
variable, it allows overriding the value at the command line, and
(notably) it avoids a Make parse error in the libxml2_la_LDFLAGS
assignment when @MODULE_PLATFORM_LIBS@ is empty
* Changed how the THREADS_W32 mechanism switches the build between
testThreads.c and testThreadsWin32.c as appropriate; using AM_CONDITIONAL
allows this to work cleanly and plays well with dependencies
* testapi.c should be specified as BUILT_SOURCES
* Create symlinks to the test/ and result/ subdirs so that the runtests
target is usable in out-of-source-tree builds
* Don't do MAKEFLAGS+=--silent as this is not portable to non-GNU Makes
* Fixed incorrect find(1) syntax in the "cleanup" rule, and doing "rm -f"
instead of just "rm" is good form
* (DIST)CLEANFILES needed a bit more coverage to allow "make distcheck" to
pass
configure.in:
* Need AC_PROG_LN_S to create test/ and result/ symlinks in Makefile.am
* AC_LIBTOOL_WIN32_DLL and AM_PROG_LIBTOOL are obsolete; these have been
superceded by LT_INIT
* Don't rebuild docs by default, as this requires GNU Make (as
implemented)
* Check for uint32_t as some platforms don't provide it
* Check for some more functions, and undefine HAVE_MMAP if we don't also
HAVE_MUNMAP (one system I tested on actually needed this)
* Changed THREADS_W32 from a filename insert into an Automake conditional
* The "Copyright" file will not be in the current directory if builddir !=
srcdir
doc/Makefile.am:
* EXTRA_DIST cannot use wildcards when they refer to generated files; this
breaks dependencies. What I did was define EXTRA_DIST_wc, which uses GNU
Make $(wildcard) directives to build up a list of files, and EXTRA_DIST,
as a literal expansion of EXTRA_DIST_wc. I also added a new rule,
"check-extra-dist", to simplify checking that the two variables are
equivalent. (Note that this works only when builddir == srcdir)
(I can implement this differently if desired; this is just one way of
doing it)
* Don't define an "all" target; this steps on Automake's toes
* Fixed up the "libxml2-api.xml ..." rule by using $(wildcard) for
dependencies (as Make doesn't process the wildcards otherwise) and
qualifying appropriate files with $(srcdir)
(Note that $(srcdir) is not needed in the dependencies, thanks to VPATH,
which we can count on as this is GNU-Make-only code anyway)
doc/devhelp/Makefile.am:
* Qualified appropriate files with $(srcdir)
* Added an "uninstall-local" rule so that "make distcheck" passes
doc/examples/Makefile.am:
* Rather than use a wildcard that doesn't work, use a substitution that
most Make programs can handle
doc/examples/index.py:
* Do the same here
include/libxml/nanoftp.h:
* Some platforms (e.g. MSVC 6) already #define INVALID_SOCKET:
user@host:/cygdrive/c/Program Files/Microsoft Visual Studio/VC98/\
Include$ grep -R INVALID_SOCKET .
./WINSOCK.H:#define INVALID_SOCKET (SOCKET)(~0)
./WINSOCK2.H:#define INVALID_SOCKET (SOCKET)(~0)
include/libxml/xmlversion.h.in:
* Support ancient GCCs (I was actually able to build the library with 2.5
but for this bit)
python/Makefile.am:
* Expanded CLEANFILES to allow "make distcheck" to pass
python/tests/Makefile.am:
* Define CLEANFILES instead of a "clean" rule, and added tmp.xml to allow
"make distcheck" to pass
testRelax.c:
* Use HAVE_MMAP instead of the less explicit HAVE_SYS_MMAN_H (as some
systems have the header but not the function)
testSchemas.c:
* Use HAVE_MMAP instead of the less explicit HAVE_SYS_MMAN_H
testapi.c:
* Don't use putenv() if it's not available
threads.c:
* This fixes the following build error on Solaris 8:
libtool: compile: cc -DHAVE_CONFIG_H -I. -I./include -I./include \
-D_REENTRANT -D__EXTENSIONS__ -D_REENTRANT -Dsparc -Xa -mt -v \
-xarch=v9 -xcrossfile -xO5 -c threads.c -KPIC -DPIC -o threads.o
"threads.c", line 442: controlling expressions must have scalar type
"threads.c", line 512: controlling expressions must have scalar type
cc: acomp failed for threads.c
*** Error code 1
trio.c:
* Define isascii() if the system doesn't provide it
trio.h:
* The trio library's HAVE_CONFIG_H header is not the same as LibXML2's
HAVE_CONFIG_H header; this change is needed to avoid a double-inclusion
win32/configure.js:
* Added support for the LZMA compression option
win32/Makefile.{bcb,mingw,msvc}:
* Added appropriate bits to support WITH_LZMA=1
* Install the header files under $(INCPREFIX)\libxml2\libxml instead of
$(INCPREFIX)\libxml, to mirror the install location on Unix+Autotools
xml2-config.in:
* @MODULE_PLATFORM_LIBS@ (usually "-ldl") needs to be in there in order for
`xml2-config --libs` to provide a complete set of dependencies
xmllint.c:
* Use HAVE_MMAP instead of the less-explicit HAVE_SYS_MMAN_H
To avoid digging into buf->buffer insternal strcuture the two
new entry points xmlOutputBufferGetContent() and
xmlOutputBufferGetSize() should make the ode cleaner.
* include/libxml/xmlIO.h: add two new functions
* xmlIO.c: impement the 2 functions based on the new buffer entry points
Now tree.h exports LIBXML2_NEW_BUFFER macro indicating that the
API uses the new buffers, important to keep code working with
both versions.
* tree.h buf.h: also export xmlBufContent(), xmlBufEnd(), and xmlBufUse()
to help port the old code
* buf.c: make sure the compatibility counters are updated on
buffer usage, to keep proper working of application compiled
against the old structures, but take care of int overflow
Those can be overrided by the XML_PARSE_HUGE option, they
are just default limits for Name lenght, dictionary size limits
and maximum amount of parser lookup.
* include/libxml/parserInternals.h: define the limits
* include/libxml/xmlerror.h: add a new error
* parser.c parserInternals.c: implements the new limits
* include/libxml/dict.h dict.c: adding 2 new functions xmlDictGetUsage
and xmlDictSetLimit allowing to review the amount of memory allocated
for dictionary strings. Aslo cleanup of various signed int used as
size values in the code.
* uri.c: cleanup the code doing the allocations, set up a structured
error handler to report memory errors, and set up an abitrary
limit on URI saving size
* error.c include/libxml/xmlerror.h: add a new FROM_URI indication
for structured error reporting, also adding strings for schematron
and buffer which were missing
* include/libxml/tree.h: adds xmlBufGetNodeContent and xmlBufNodeDump
as xmlBuf based equivalents of xmlNodeGetContent and xmlNodeDump
* tree.c: implements one new routine and converts xmlNodeBufGetContent
to use the xmlBuf equivalent. It should behave better as a result
in case of data larger than 2GB.
Since the whole set of structures was public, the only way
to switch to size_t clean buffer is to introduce an incompatible
API change. Modifying the xmlParserInputBuffer and xmlOutputBuffer
structures is the best place to make this change as those
structures are deep into the parser feeding data, and no public
API suggest to build those manually.
This also add converter functions between xmlBuf and xmlBuffer
* buf.c buf.h: the old xmlBuffer routines but modified for size_t
and using xmlBuf instead of xmlBuffer
* Makefile.am: add the 2 new files
* include/libxml/xmlerror.h: add an entry for the new module
* include/libxml/tree.h: expose the xmlBufPtr type but not the
structure which stay private
tsan reported that rand() is not thread safe, so create
a thread safe wrapper, use rand_r() if available.
Consolidate the function, initialization and cleanup in
dict.c and make sure it is initialized in xmlInitParser()
For https://bugzilla.gnome.org/show_bug.cgi?id=643148
Reported by Bill Clarke <llib@computer.org>, it used a global variable
as a counter for the input id and this was not thread safe. To avoid the
race without adding unneeded locking in the parser path, move the id to
the parser context instead.
On Fri, May 11, 2012 at 9:10 AM, Daniel Veillard <veillard@redhat.com> wrote:
> Hi Conrad,
>
> that's interesting ! I was initially afraid of a sudden explosion of
> memory allocations for building a tree since by default buffers tend to
> "waste" memory by using doubling allocations, but that's not the case.
> xmllint --noout doc/libxml2-api.xml
> when compiled with memory debug produce
>
> paphio:~/XML -> cat .memdump
> MEMORY ALLOCATED : 0, MAX was 12756699
>
> and without your patch 12755657, i.e. the increase is minimal.
Heh, I thought that too. Actually you're looking at the result with XML_ALLOC_EXACT! This
is because EXACT adds 10bytes "spare" on each alloc, and that interestingly wastes about the
same amount of space as XML_ALLOC_DOUBLEIT on this example (see below).
So it turns out that the default realloc() on my system actually handles this case really
well — and I guess that all the time in xmlRealloc() was actually in xmlStrlen, not the
underlying realloc() after all (sorry for misleading you). If you replace the realloc()
with a bad one (like valgrind's), then the performance degrades severely.
This patch implements a HYBRID allocator which has the behaviour you describe (it's
like EXACT to start with, though without the spare 10 bytes; and switches to DOUBLEIT
after 4kb) — that gets the memory back down to 12755657, with no noticeable impact on the
performance of the synthetic pathological example under valgrind.
In summary:
max_memory on ./xmllint --noout doc/libxml2-api.xml,
valgrind time on https://gist.github.com/2656940
max_memory valgrind time
before | 12755657 | 29:18.2
EXACT | 12756699 | 2:58.6 <-- this is the state after the first patch.
DOUBLEIT | 12756727 | 0:02.7
HYBRID | 12755754 | 0:02.7 <-- this is the state with both patches.
>
> There is also the cost of creating the buffers all the time.
> I need to read the code and check but I may be interested in an hybrid
> approach where we switch to buffer only when the text node starts to
> become too big (4k would remove nearly all usuall types of "document"
> usage, i.e. not blocks of data)
I tried to avoid too much buffer creation by introducing the xmlBufferDetach function,
which allows re-using one buffer to construct many strings. It's maybe a bit of a "hack"
in API terms though I thought the gains would be worth it.
Conrad
------8<------
To keep memory usage tight in normal conditions it's desirable to only
allocate as much space as is needed. Unfortunately this can lead to
problems when constructing a long string out of small chunks, because
every chunk you add will need to resize the buffer.
To fix this XML_ALLOC_HYBRID will switch (when the buffer is 4kb big)
from using exact allocations to doubling buffer size every time it is
full. This limits the number of buffer resizes to O(log n) (down from
O(n)), and thus greatly increases the performance of constructing very
large strings in this manner.
Hi Veillard and all,
Firstly, thanks for libxml: it's awesome!
I noticed recently that libxml was taking a surprisingly long time to perform some
operations (many minutes instead of milliseconds), and so I did some digging. It turns out
that the problem was caused by the realloc()ing done in xmlNodeAddContentLen() which can
be called many (many) times when assigning some content into a node.
For background, I'm dealing with XML that contains emails, these can have large
attachments (~6MB) which are base-64 encoded, line-wrapped at 78 chars, and each line ends
with . This means that xmlNodeAddContentLen() is being called about 200,000 times,
and so there are 200,000 reallocs of a 6MB string, which takes a while... (I put a synthetic
example of this at https://gist.github.com/2656940)
The attached patch works around that problem by using the existing buffer API to merge the
strings together before even creating the text node, this keeps the number of realloc()s
at a managable level.
I'd love feedback on the patch, and am happy to fix problems with it, or explore other
solutions if you think that this is barking up the wrong tree :).
Thanks,
Conrad
P.S. Should I create a bug for this too?
------8<------
Before this change xmlStringGetNodeList would perform a realloc() of the
entire new content for every XML entity in the assigned text in order to
merge together adjacent text nodes. This had the effect of making
xmlSetNodeContent O(n^2), which led to unexpectedly bad performance on
inputs that contained a large number of XML entities.
After this change the memory management is done by the buffer API,
avoiding the need to continually re-measure and realloc() the string.
For my test data (6MB of 80 character lines, each ending with )
this takes the time to xmlSetNodeContent from about 500 seconds to
around 50ms. I have not profiled smaller cases, though I tried to
minimize the performance impact of my change by avoiding unnecessary
string copying.
Signed-off-by: Conrad Irwin <conrad.irwin@gmail.com>
Since there is xmlTextReaderSchemaValidateCtxt() it seems like there
should be an equivalent RelaxNG function. The attached patch adds it.
The code is essentially the same as Schema implementation, but I'm
uncertain as to how to add things to the documentation and test suite:
there seems to be a lot of auto-generation going on.
For both XML and HTML, the document can provide an encoding
either in XMLDecl in XML, or as a meta element in HTML head.
This adds options to ignore those encodings if the encoding
is known in advace for example if the content had been converted
before being passed to the parser.
* parser.c include/libxml/parser.h: add XML_PARSE_IGNORE_ENC option
for XML parsing
* include/libxml/HTMLparser.h HTMLparser.c: adds the
HTML_PARSE_IGNORE_ENC for HTML parsing
* HTMLtree.c: fix the handling of saving when an unknown encoding is
defined in meta document header
* xmllint.c: add a --noenc option to activate the new parser options
In Windows 64 a socket is no more represented by an int,
this breaks the nanoftp API and nanoftp/nanohttp, the patch
changes this and fix the API for Win64
Regenerated the XML and documentation as a result too.
non destructive indentation option using spaces within markup
constructs and hence not modifying content
* include/libxml/xmlsave.h: new option
* xmlsave.c: some refactoring and new code for the new option
* xmllint.c: adds --pretty option where option 2 uses the new formatting
- include/libxml/HTMLparser.h: defines the new HTML parser option
HTML_PARSE_NODEFDTD
- HTMLparser.c: if option is set don't add a default DTD
- xmllint.c: add the corresponding --nodefdtd option in xmllint
* HTMLparser.c: new htmlParseElementInternal non recursive, with
htmlParseContentInternal and new function to handle node info
and element end.
* include/libxml/parser.h: add new stack for element info in parser
context
* parserInternals.c: fee element info stack
- include/libxml/xmlexports.h: restore export decoration otherwise
xsltproc and xmlsec crash
- libxml.h: define LIBXML_STATIC for static build
- configure.in: enable modules support for mingw* builds
- Makefile.am: flags for testdso if modules support enabled
xmlParseInNodeContext notices that the enclosing document is
an HTML document, so invoke the HTML parser for that fragment, and
the HTML parser finding a "<p>hello world!</p>" document automatically
augment it with defaulted <html> and <body>. This defaulting should
be turned off in the HTML parser for this to work, but there is no
such HTML parser option. There is an htmlOmittedDefaultValue global
variable that you could use, but really we should not rely on global
variable for processing options anymore, best is to add an
HTML_PARSE_NOIMPLIED.
* include/libxml/HTMLparser.h: add the HTML_PARSE_NOIMPLIED parser flag
* HTMLparser.c: do add implied element if HTML_PARSE_NOIMPLIED is set
* parser.c: add HTML_PARSE_NOIMPLIED to options for xmlParseInNodeContext
on HTML documents
* include/libxml/globals.h globals.c global.data: define a new global
variable (per thread) for structured error reporting, to not conflict
with generic one
* error.c: when defined use the structured error report over any generic
one
* HTMLparser.c: check when we see an head or a body tag and avoid
autogenerating them
* include/libxml/parser.h: the values for ctxt->html change depending
on the head or body tags being seen
* include/libxml/xmlexports.h: when compiling with mingw/MSYS or linking
to an precompiled library this _imp__xmlFree missing at runtime is a
common problem. Igor and various people faced it and this seems the
minimal fix for it, should resolve 590302 and 561340
* c14n.c include/libxml/c14n.h: adds support for C14N 1.1,
new flags at the API level
* runtest.c Makefile.am testC14N.c xmllint.c: add support in CLI
tools and test binaries
* result/c14n/1-1-without-comments/* test/c14n/1-1-without-comments/*:
add a new batch of tests
* include/libxml/parser.h include/libxml/xmlwriter.h
include/libxml/relaxng.h include/libxml/xmlversion.h.in
include/libxml/xmlwin32version.h.in include/libxml/valid.h
include/libxml/xmlschemas.h include/libxml/xmlerror.h: change
ATTRIBUTE_PRINTF into LIBXML_ATTR_FORMAT to avoid macro name
collisions with other packages and headers as reported by
Belgabor and Mike Hommey
daniel
svn path=/trunk/; revision=3827