1
0
mirror of https://gitlab.gnome.org/GNOME/libxml2.git synced 2026-01-28 10:01:00 +03:00
Commit Graph

4004 Commits

Author SHA1 Message Date
Daniel Veillard
2a1d2422a4 Convert catalog code to the new input buffers
Only one place where the buffers fields where accessed directly
2012-07-23 14:24:27 +08:00
Daniel Veillard
53aa293dd3 Convert C14N to the new Input buffer
one case of direct access cleaned up
2012-07-23 14:24:27 +08:00
Daniel Veillard
a6a6e70c47 Convert xmlIO.c to the new input and output buffers
Relatively mechanical changes, this also led to a couple of fixes
upon review of the I/O code on buffer usage.
2012-07-23 14:24:26 +08:00
Daniel Veillard
768eb3b82d Convert XML parser to the new input buffers
The main changes are when the internal of the buffers structure
were adressed directly, we now use routines coming from buf.h
The routine xmlParserInputRead() which wasn't used anywhere is
deprecated too.
2012-07-23 14:24:26 +08:00
Daniel Veillard
65c7d3b2e6 Incompatible change to the Input and Output buffers
Since the whole set of structures was public, the only way
to switch to size_t clean buffer is to introduce an incompatible
API change. Modifying the xmlParserInputBuffer and xmlOutputBuffer
structures is the best place to make this change as those
structures are deep into the parser feeding data, and no public
API suggest to build those manually.
2012-07-23 14:24:26 +08:00
Daniel Veillard
18d0db2503 Adding new encoding function to deal with the new structures
* encoding.c: adds xmlCharEncFirstLineInput, xmlCharEncInput and
  xmlCharEncOutput
* enc.h: the functions are not made public but added to this new header
2012-07-23 14:24:26 +08:00
Daniel Veillard
ade10f2c57 Convert XPath to xmlBuf
Easy as no buffer was exported in the APIs
2012-07-23 14:24:26 +08:00
Daniel Veillard
bca22f40c3 Adding a new buf module for buffers
This also add converter functions between xmlBuf and xmlBuffer
* buf.c buf.h: the old xmlBuffer routines but modified for size_t
  and using xmlBuf instead of xmlBuffer
* Makefile.am: add the 2 new files
* include/libxml/xmlerror.h: add an entry for the new module
* include/libxml/tree.h: expose the xmlBufPtr type but not the
  structure which stay private
2012-07-23 14:24:26 +08:00
Daniel Veillard
4629ee02ac Do not fetch external parsed entities
Unless explicietely asked for when validating or replacing entities
with their value. Problem pointed out by Tom Lane <tgl@redhat.com>

* parser.c: do not load external parsed entities unless needed
* test/errors/extparsedent.xml result/errors/extparsedent.xml*:
  add a regression test to avoid change of the behaviour in the future
2012-07-23 14:15:40 +08:00
Aron Xu
baaf03f80f Fix an error in previous commit 2012-07-20 15:41:34 +08:00
Daniel Veillard
4f9fdc709c Fix entities local buffers size problems 2012-07-18 17:54:05 +08:00
Daniel Veillard
459eeb9dc7 Fix parser local buffers size problems 2012-07-18 17:54:04 +08:00
Daniel Veillard
740cb1a450 Memory error within SAX2 reuse common framework
There is no reason for that class of errors to not use
the same handling allowing strctured error processing.
2012-07-18 17:48:32 +08:00
Daniel Veillard
c508fa3f0b Fix a failure to report xmlreader parsing failures
Related to https://bugzilla.gnome.org/show_bug.cgi?id=654567
the problem is that the provided patch failed to raise an error
on xmlTextReaderRead() return when an actual parsing error occured
2012-07-18 17:48:06 +08:00
Daniel Veillard
549f06a8bd Expand .gitignore with more files 2012-07-11 15:21:12 +08:00
Daniel Veillard
8fc913fcc9 Fix compilation on older Visual Studio
For https://bugzilla.gnome.org/show_bug.cgi?id=666491

Reported by Matt Budd <matt.budd@gmail.com>, the added support
for VS 2010 broke older version 2005 and 2008 because it assumed
some of the defines where present in all versions, fix that
to check the version of VS
2012-06-06 11:29:29 +08:00
Daniel Veillard
2e1eaca637 Fix xmllint --xpath node initialization
By default it's more sensible to initialize it to the document itself
than the root element
2012-05-25 16:44:20 +08:00
Daniel Veillard
c943f708f1 Release of libxml2-2.8.0
- Makefile.am: don't package .git
- configure.in : update to new release
- doc/xml.html: added the new release
- doc/* testapi.c: regenerated
v2.8.0
2012-05-23 17:10:59 +08:00
Daniel Veillard
22030ef888 Restore code for Windows compilation
Try to keep as close to rc1 but still allow the change from Roumen for
mingw
2012-05-23 15:52:45 +08:00
Daniel Veillard
ee8f1d4cda Cleanups before 2.8.0-rc2
new symbols, a missing comment and a fix on symbol release
v2.8.0-rc2
2012-05-21 11:16:12 +08:00
Roumen Petrov
978ff224b2 use mingw C99 compatible functions {v}snprintf instead those from MSVC runtime 2012-05-21 10:20:09 +08:00
Daniel Veillard
f27c6683e6 New symbols added for the next release 2012-05-21 10:20:09 +08:00
Daniel Veillard
59df1e4f92 Avoid an extra operation
In the catalog code, tsan also complained of testing
the variable without locking and that was done a few lines below
2012-05-21 10:19:21 +08:00
Daniel Veillard
d495e6a845 Part for rand_r checking missing
Forgot to push that change in previous commit
2012-05-20 20:48:34 +08:00
Daniel Veillard
379ebc1d77 Cleanup on randomization
tsan reported that rand() is not thread safe, so create
a thread safe wrapper, use rand_r() if available.
Consolidate the function, initialization and cleanup in
dict.c and make sure it is initialized in xmlInitParser()
2012-05-18 15:41:31 +08:00
Andy Lutomirski
9d9685ad88 xmlTextReader bails too quickly on error
For https://bugzilla.gnome.org/show_bug.cgi?id=654567
I use xmlTextReader to parse failed that might be incomplete.  These files are
the beginning of a well-formed file, but the end is missing so the file as a
whole is not well-formed.

The problem is that xmlTextReader starts returning errors when it encounters
the early EOF, even though I haven't finished reading all of the valid data in
the file.  It would be helpful if xmlTextReader kept working until the very
end.
v2.8.0-rc1
2012-05-15 20:10:25 +08:00
Pacho Ramos
1ea6b14125 Fix undefined reference in python module
For https://bugzilla.gnome.org/show_bug.cgi?id=622023
when compiled with LDFLAGS="${LDFLAGS} -Wl,-z,-defs -Wl,--no-undefined"
the python module would failed due to the undefined. This add an
explicit reference to python lib.
2012-05-15 19:36:02 +08:00
Daniel Veillard
0d51cfebc9 Fix a race in xmlNewInputStream
For https://bugzilla.gnome.org/show_bug.cgi?id=643148
Reported by Bill Clarke <llib@computer.org>, it used a global variable
as a counter for the input id and this was not thread safe. To avoid the
race without adding unneeded locking in the parser path, move the id to
the parser context instead.
2012-05-15 11:18:40 +08:00
Noam
9313ae8517 Fix weird streaming RelaxNG errors
For https://bugzilla.gnome.org/show_bug.cgi?id=512454
The bug was to use compiled determinitic automata when
the content model was found to be non-deterministic, leading
to random parsing errors.
2012-05-15 11:03:46 +08:00
Daniel Veillard
94431ecba6 Fix various bugs in new code raised by the API checking
* testapi.c: regenerated and covering new APIs
* tree.c: xmlBufferDetach can't work on immutable buffers
* xzlib.c: fix a deallocation error
2012-05-15 10:45:05 +08:00
Daniel Veillard
79ee284abb Fix various problems with "make dist"
* tree.c: missing documentation for xmlBufferDetach
* doc/symbols.xml: add two new symbols xmlTextReaderRelaxNGValidateCtxt
                   and xmlBufferDetach
* doc/apibuild.py: ignore internal header xzlib.h
2012-05-15 10:25:31 +08:00
Daniel Veillard
9f3cdef08a Fix a memory leak in the xzlib code
The freeing function wasn't called due to a bogus #ifdef surrounding
value. Also switch the code to use the normal libxml2 allocation and
freeing routines.
2012-05-15 09:38:13 +08:00
Conrad Irwin
7d0d2a50ac Use a hybrid allocation scheme in xmlNodeSetContent
On Fri, May 11, 2012 at 9:10 AM, Daniel Veillard <veillard@redhat.com> wrote:
>  Hi Conrad,
>
> that's interesting ! I was initially afraid of a sudden explosion of
> memory allocations for building a tree since by default buffers tend to
> "waste" memory by using doubling allocations, but that's not the case.
>  xmllint --noout doc/libxml2-api.xml
> when compiled with memory debug produce
>
> paphio:~/XML -> cat .memdump
>      MEMORY ALLOCATED : 0, MAX was 12756699
>
> and without your patch 12755657, i.e. the increase is minimal.

Heh, I thought that too. Actually you're looking at the result with XML_ALLOC_EXACT! This
is because EXACT adds 10bytes "spare" on each alloc, and that interestingly wastes about the
same amount of space as XML_ALLOC_DOUBLEIT on this example (see below).

So it turns out that the default realloc() on my system actually handles this case really
well — and I guess that all the time in xmlRealloc() was actually in xmlStrlen, not the
underlying realloc() after all (sorry for misleading you). If you replace the realloc()
with a bad one (like valgrind's), then the performance degrades severely.

This patch implements a HYBRID allocator which has the behaviour you describe (it's
like EXACT to start with, though without the spare 10 bytes; and switches to DOUBLEIT
after 4kb) — that gets the memory back down to 12755657, with no noticeable impact on the
performance of the synthetic pathological example under valgrind.

In summary:

     max_memory on ./xmllint --noout doc/libxml2-api.xml,
     valgrind time on https://gist.github.com/2656940

            max_memory    valgrind time
before   |  12755657    | 29:18.2
EXACT    |  12756699    |  2:58.6 <-- this is the state after the first patch.
DOUBLEIT |  12756727    |  0:02.7
HYBRID   |  12755754    |  0:02.7 <-- this is the state with both patches.

>
> There is also the cost of creating the buffers all the time.
> I need to read the code and check but I may be interested in an hybrid
> approach where we switch to buffer only when the text node starts to
> become too big (4k would remove nearly all usuall types of "document"
> usage, i.e. not blocks of data)

I tried to avoid too much buffer creation by introducing the xmlBufferDetach function,
which allows re-using one buffer to construct many strings. It's maybe a bit of a "hack"
in API terms though I thought the gains would be worth it.

Conrad

------8<------

To keep memory usage tight in normal conditions it's desirable to only
allocate as much space as is needed. Unfortunately this can lead to
problems when constructing a long string out of small chunks, because
every chunk you add will need to resize the buffer.

To fix this XML_ALLOC_HYBRID will switch (when the buffer is 4kb big)
from using exact allocations to doubling buffer size every time it is
full. This limits the number of buffer resizes to O(log n) (down from
O(n)), and thus greatly increases the performance of constructing very
large strings in this manner.
2012-05-14 14:18:58 +08:00
Conrad Irwin
7d553f834e Use buffers when constructing string node lists.
Hi Veillard and all,

Firstly, thanks for libxml: it's awesome!

I noticed recently that libxml was taking a surprisingly long time to perform some
operations (many minutes instead of milliseconds), and so I did some digging. It turns out
that the problem was caused by the realloc()ing done in xmlNodeAddContentLen() which can
be called many (many) times when assigning some content into a node.

For background, I'm dealing with XML that contains emails, these can have large
attachments (~6MB) which are base-64 encoded, line-wrapped at 78 chars, and each line ends
with &#13;. This means that xmlNodeAddContentLen() is being called about 200,000 times,
and so there are 200,000 reallocs of a 6MB string, which takes a while... (I put a synthetic
example of this at https://gist.github.com/2656940)

The attached patch works around that problem by using the existing buffer API to merge the
strings together before even creating the text node, this keeps the number of realloc()s
at a managable level.

I'd love feedback on the patch, and am happy to fix problems with it, or explore other
solutions if you think that this is barking up the wrong tree :).

Thanks,

Conrad

P.S. Should I create a bug for this too?

------8<------

Before this change xmlStringGetNodeList would perform a realloc() of the
entire new content for every XML entity in the assigned text in order to
merge together adjacent text nodes. This had the effect of making
xmlSetNodeContent O(n^2), which led to unexpectedly bad performance on
inputs that contained a large number of XML entities.

After this change the memory management is done by the buffer API,
avoiding the need to continually re-measure and realloc() the string.

For my test data (6MB of 80 character lines, each ending with &#13;)
this takes the time to xmlSetNodeContent from about 500 seconds to
around 50ms. I have not profiled smaller cases, though I tried to
minimize the performance impact of my change by avoiding unnecessary
string copying.

Signed-off-by: Conrad Irwin <conrad.irwin@gmail.com>
2012-05-14 13:51:30 +08:00
Denis Pauk
a0cd075d94 HTML parser error with <noscript> in the <head>
For https://bugzilla.gnome.org/show_bug.cgi?id=615785
When the <noscript> is found, <head> is closed and a <body> element is created.
The real <body id="xxx"> gets skipped over, so I can't see any of the
body's attributes.
Just don't close <head> when encountering a <noscript>
Add a regression test too
2012-05-11 19:31:12 +08:00
Remi Gacogne
4609e6c980 XSD: optional element in complex type extension
For https://bugzilla.gnome.org/show_bug.cgi?id=609796
Libxml2 fails to validate an instance document against a schema if an element
whose type is a complex extension of some base type with an optional child
element and that child element is not specified in the instance document.  For
example, suppose I have some complex type BaseType that is defined to have one
child element in a sequence group that has minOccurs set to 0
2012-05-11 15:31:05 +08:00
Daniel Veillard
39d027cdb7 Fix html serialization error and htmlSetMetaEncoding()
For https://bugzilla.gnome.org/show_bug.cgi?id=630682
The python tests were reporting errors, some of it was due to
a small change in case encoding, but the main one was about
htmlSetMetaEncoding(doc, NULL) being broken by not removing
the associated meta tag anymore
2012-05-11 12:38:23 +08:00
Daniel Veillard
2c437da7f0 Fix a wrong return value in previous patch 2012-05-11 12:08:15 +08:00
Daniel Veillard
ed35d3d7c3 Fix an uninitialized variable use
When compiled without SAX1 support
2012-05-11 10:52:27 +08:00
Brandon Slack
0c7109c81f Fix a compilation problem with --minimum
For https://bugzilla.gnome.org/show_bug.cgi?id=636750
Moved a #endif /* LIBXML_OUTPUT_ENABLED */ a few lines down
to avoid reference an undefined variable
2012-05-11 10:50:59 +08:00
Daniel Veillard
399aaba14b Remove redundant and ungarded include of resolv.h
For https://bugzilla.gnome.org/show_bug.cgi?id=617053
This broke the build on Interix-6.0
2012-05-11 10:09:32 +08:00
Christian Dywan
040dcb5995 Remove git error message during configure
For https://bugzilla.gnome.org/show_bug.cgi?id=635531
If git is not installed but .git was found configure would emit an
error message
2012-05-10 22:55:07 +08:00
Patrick R. Gansterer
023206fc08 xmllint: Build fix for endTimer if !defined(HAVE_GETTIMEOFDAY)
For https://bugzilla.gnome.org/show_bug.cgi?id=638649
code was broken !
2012-05-10 22:17:51 +08:00
John Hein
a4fe9b26d3 emove a bashism in confgure.in
Not portable, broke on old FreeBSD
2012-05-10 22:12:46 +08:00
Shaun McCance
4cf7325e1f xinclude with parse="text" does not use the entity loader
For https://bugzilla.gnome.org/show_bug.cgi?id=552479

The code for xinclude parse="text" was not using the registered
entity loader, defeating attempts to control loading of files.
2012-05-10 20:59:33 +08:00
Denis Pauk
fdf990c2ef Allow to parse 1 byte HTML files
For https://bugzilla.gnome.org/show_bug.cgi?id=605740

File 1 byte long were not accepted by the HTML push parser
2012-05-10 20:40:49 +08:00
Patrick R. Gansterer
204f1f144c undef ERROR if already defined 2012-05-10 20:24:00 +08:00
Martin Schröder
b91111b475 Patch that fixes the skipping of the HTML_PARSE_NOIMPLIED flag
For https://bugzilla.gnome.org/show_bug.cgi?id=642916

I just noticed that the HTML_PARSE_NOIMPLIED flag that you can pass to the
HTML-Parser methods doesn't do anything. Its intended purpose is to stop the
HTML-parser from forcibly adding a pair of html/body tags if the stream does
not contain any.

This is highly useful when you don't need this level of strictness.
Unfortunately, specifying it doesn't work, because the option is not
copied into the parsing context.
2012-05-10 18:52:37 +08:00
Lin Yi-Li
24464be639 Avoid memory leak if xmlParserInputBufferCreateIO fails
For https://bugzilla.gnome.org/show_bug.cgi?id=643949

In case of error on an IO creation input the given context
is terminated with the given close function, except if the
error happened in xmlParserInputBufferCreateIO. This can
lead to a resource leak which is fixed by this patch.
2012-05-10 16:14:55 +08:00
Denis Pauk
868d92da89 Add HTML parser support for HTML5 meta charset encoding declaration
For https://bugzilla.gnome.org/show_bug.cgi?id=655218

http://www.w3.org/TR/2011/WD-html5-20110525/semantics.html#the-meta-element

"""
The charset attribute specifies the character encoding used by the document.
This is a character encoding declaration. If the attribute is present in an XML
document, its value must be an ASCII case-insensitive match for the string
"UTF-8" (and the document is therefore forced to use UTF-8 as its
encoding).
"""

However, while <meta http-equiv="Content-Type" content="text/html;
charset=utf8"> works, <meta charset="utf8"> does not.

While libxml2 HTML parser is not tuned for HTML5, this is a simple
addition

Also added a testcase
2012-05-10 15:34:57 +08:00