Fix many params in internal functions (not really necessary but Doxygen
warns about that in XML mode).
Fix formatting in a few corner cases that automatic conversion can't
handle.
Rearrange some DOC_DISABLE blocks.
Always use what the old implementation called the "IO" allocation
scheme, allowing to move the content pointer past the initial
allocation. This is inexpensive and allows efficient shrinking.
Optimize xmlBufGrow, reusing shrunken memory as much as possible.
Simplify xmlBufAdd.
Make xmlBufBackToBuffer return an error on overflow.
Make "size" exclude the terminating NULL byte.
Always provide an initial size.
Reintroduce static buffers.
Remove xmlBufResize and several other functions.
Make xmlOpenCharEncodingHandler call xmlParseCharEncoding first so we
prefer our own handlers for names like "UTF8". Only UTF-16 needs an
exception.
Make callers check the return value. For UTF-8, a NULL encoding doesn't
mean an error.
Remove unnecessary UTF-8 check from htmlFindOutputEncoder. Don't try to
look up ASCII handler since the HTML handler is always available.
Fix return code of xmlParseCharEncoding.
Should fix#744.
Handle malloc failrue from xmlRaiseError.
Use xmlRaiseMemoryError.
Stop using xmlGenericError.
Remove argument from memory error handler.
Remove TODO macro.
Remove explicit integer casts as final operation
- in assignments
- when passing arguments
- when returning values
Remove casts
- to the same type
- from certain range-bound values
The main motivation is that these explicit casts don't change the result
of operations and only render UBSan's implicit-conversion checks
useless. Removing these casts allows UBSan to detect cases where
truncation or sign-changes occur unexpectedly.
Document some explicit casts as truncating and add a few missing ones.
Private functions were previously declared
- in header files in the root directory
- in public headers guarded with IN_LIBXML
- in libxml.h
- redundantly in source files that used them.
Consolidate all private header files in include/private.
Patch by J Pascoe of Apple.
* HTMLtree.c:
(htmlDocContentDumpFormatOutput):
- Prior to commit b79ab6e6d9, xmlDoc.type was set to
XML_HTML_DOCUMENT_NODE before dumping the HTML output, then
restored before returning.
Similar to 8f5710379, mark more static data structures with
`const` keyword.
Also fix placement of `const` in encoding.c.
Original patch by Sarah Wilkin.
The old, non-recursive HTML serialization code would always terminate
the output with a newline. The new implementation omitted the newline
if the document node had no children. Readd the newline when
serializing empty documents.
Fixes#266.
Make xmlNodeDumpOutput and htmlNodeDumpFormatOutput work with corrupted
parent pointers. This used to work with the old recursive code but the
non-recursive rewrite required parent pointers to be set correctly.
Unfortunately, lxml relies on the old behavior and passes subtrees with
a corrupted structure. Fall back to a recursive function call if an
invalid parent pointer is detected.
Fixes#255.
Check parent pointers for NULL after the non-recursive rewrite of the
serialization code. This avoids segfaults with corrupted documents
which can apparently be seen with lxml, see issue #187.
This reverts commit 960f0e2756.
This commit introduced
- an infinite loop, found by OSS-Fuzz, which could be easily fixed.
- an algorithm with quadratic runtime
- a security issue, see
https://bugzilla.gnome.org/show_bug.cgi?id=769760
A better approach is to add an option not to escape URLs at all
which libxml2 should have possibly done in the first place.
For https://bugzilla.gnome.org/show_bug.cgi?id=747301
Use simple HTML5 DOCTYPE for about:legacy-compat
HTML5 uses a DOCTYPE without a PUBLIC or SYSTEM identifier. It looks
like this:
<!DOCTYPE html>
I can't use XSLT to output this, because to get a DOCTYPE I have to
provide a PUBLIC or SYSTEM identifier. Luckily, the standards folks
recognized this and provided this semantically equivalent form for the
HTML DOCTYPE:
<!DOCTYPE html SYSTEM "about:legacy-compat">
But people don't like seeing the "legacy" identifier in their output.
They'd rather see the shiny new DOCTYPE. Since we know that
about:legacy-compat is defined by the W3C to be semantically equivalent
to the sans-SYSTEM DOCTYPE, we could just special-case it in the HTML
serializer in libxml2. So if you set the SYSTEM identifier to
"about:legacy-compat", you get an HTML5 short-form DOCTYPE.
Handle special cases of &{...} constructs as hinted in the spec
http://www.w3.org/TR/html401/appendix/notes.html#h-B.7.1
and special values as comment <!-- ... --> used for server side includes
This is limited to attribute values in HTML content.
For https://bugzilla.gnome.org/show_bug.cgi?id=630682
The python tests were reporting errors, some of it was due to
a small change in case encoding, but the main one was about
htmlSetMetaEncoding(doc, NULL) being broken by not removing
the associated meta tag anymore
For both XML and HTML, the document can provide an encoding
either in XMLDecl in XML, or as a meta element in HTML head.
This adds options to ignore those encodings if the encoding
is known in advace for example if the content had been converted
before being passed to the parser.
* parser.c include/libxml/parser.h: add XML_PARSE_IGNORE_ENC option
for XML parsing
* include/libxml/HTMLparser.h HTMLparser.c: adds the
HTML_PARSE_IGNORE_ENC for HTML parsing
* HTMLtree.c: fix the handling of saving when an unknown encoding is
defined in meta document header
* xmllint.c: add a --noenc option to activate the new parser options