It seems that the specification of the HTML output method in XSLT 1.0
had a lot of influence on how the HTML serializer in libxml2 ended up:
https://www.w3.org/TR/xslt-10/#section-HTML-Output-Method
There are two remaining behaviors suggested by XSLT 1.0 that don't match
the HTML5 fragment serialization algorithm:
We escape non-ASCII characters in URI attributes (the list of which is
probably outdated). This was originally recommended in appendix B of the
HTML 4.01 spec, but only for user agents:
https://www.w3.org/TR/html401/appendix/notes.html#h-B.2.1
From my experience, any tool that processes HTML should escape as little
as possible. For example, we used to escape many more characters which
are invalid in URIs, but often used in template languages. (Note that we
still escape whitespace and control chars.) Nevertheless, I guess that
some libxslt users continue to expect this behavior from libxml2.
Then we collapse Boolean attributes using an outdated list. This is
mostly a cosmetic issue, but a somewhat important one for libxslt users.
We probably need a serialization option for the xmlsave module that
enables fully HTML5-conformant output.
Don't use encoding from meta tags when serializing. Only use the value
in `doc->encoding`, matching the XML serializer. This is the actual
encoding used when parsing.
Stop modifying the input document by setting meta tags before
serializing. Meta tags are now injected during serialization.
Add full support for <meta charset=""> which is also used when adding
meta tags.
Align with HTML5 and implement the "algorithm for extracting a character
encoding from a meta element". Only modify the encoding substring in
Content-Type meta tags.
Only switch encoding once when parsing.
Fix htmlSaveFileFormat with a NULL encoding not to declare a misleading
UTF-8 charset.
Fixes#909.
Fix many params in internal functions (not really necessary but Doxygen
warns about that in XML mode).
Fix formatting in a few corner cases that automatic conversion can't
handle.
Rearrange some DOC_DISABLE blocks.
Always use what the old implementation called the "IO" allocation
scheme, allowing to move the content pointer past the initial
allocation. This is inexpensive and allows efficient shrinking.
Optimize xmlBufGrow, reusing shrunken memory as much as possible.
Simplify xmlBufAdd.
Make xmlBufBackToBuffer return an error on overflow.
Make "size" exclude the terminating NULL byte.
Always provide an initial size.
Reintroduce static buffers.
Remove xmlBufResize and several other functions.
Make xmlOpenCharEncodingHandler call xmlParseCharEncoding first so we
prefer our own handlers for names like "UTF8". Only UTF-16 needs an
exception.
Make callers check the return value. For UTF-8, a NULL encoding doesn't
mean an error.
Remove unnecessary UTF-8 check from htmlFindOutputEncoder. Don't try to
look up ASCII handler since the HTML handler is always available.
Fix return code of xmlParseCharEncoding.
Should fix#744.
Handle malloc failrue from xmlRaiseError.
Use xmlRaiseMemoryError.
Stop using xmlGenericError.
Remove argument from memory error handler.
Remove TODO macro.
Remove explicit integer casts as final operation
- in assignments
- when passing arguments
- when returning values
Remove casts
- to the same type
- from certain range-bound values
The main motivation is that these explicit casts don't change the result
of operations and only render UBSan's implicit-conversion checks
useless. Removing these casts allows UBSan to detect cases where
truncation or sign-changes occur unexpectedly.
Document some explicit casts as truncating and add a few missing ones.