Remove explicit integer casts as final operation
- in assignments
- when passing arguments
- when returning values
Remove casts
- to the same type
- from certain range-bound values
The main motivation is that these explicit casts don't change the result
of operations and only render UBSan's implicit-conversion checks
useless. Removing these casts allows UBSan to detect cases where
truncation or sign-changes occur unexpectedly.
Document some explicit casts as truncating and add a few missing ones.
Private functions were previously declared
- in header files in the root directory
- in public headers guarded with IN_LIBXML
- in libxml.h
- redundantly in source files that used them.
Consolidate all private header files in include/private.
* buf.c:
(xmlBufAvail):
- Return the number of bytes available in the buffer, but do not
include a byte for the NUL terminator so that it is reserved.
* encoding.c:
(xmlCharEncFirstLineInput):
(xmlCharEncInput):
(xmlCharEncOutput):
* xmlIO.c:
(xmlOutputBufferWriteEscape):
- Remove code that subtracts 1 from the return value of
xmlBufAvail(). It was implemented inconsistently anyway.
Similar to 8f5710379, mark more static data structures with
`const` keyword.
Also fix placement of `const` in encoding.c.
Original patch by Sarah Wilkin.
These functions shouldn't be part of the public API. Most init
functions are only thread-safe when called from xmlInitParser. Global
variables should only be cleaned up by calling xmlCleanupParser.
Fix a bug in xmlCharEncOutput return value which will cause
xmlNodeDumpOutput to drop characters randomly.
xmlCharEncOutput returns zero if the length of the input buffer is
zero but ignores the fact that it may already encoded the input buffer
and the input's length is zero due to the fact that xmlEncOutputChunk
returned -2 errors and underlying code tries to fix the error by
encoding the input.
xmlCharEncOutput is collecting the number of bytes written to the
output buffer but is returning zero instead of the total number of
bytes in this situation. This commit will fix this issue by returning
the total number of bytes instead. So the xmlNodeDumpOutput will also
continue writing and will not stop due to the fact that it mistakenly
thinks the output buffer is not changed in that iteration.
Fixes#314
This makes the logic in UTF16BEToUTF8() match UTF16LEToUTF8().
* encoding.c:
(UTF16LEToUTF8):
- Fix comment to describe what the code does.
(UTF16BEToUTF8):
- Fix undefined behavior which was applied to UTF16LEToUTF8() in
2f9382033e.
- Add bounds check to while() loop which was applied to
UTF16LEToUTF8() in be803967db.
- Do not return -2 when (in >= inend) to fix the bug. This was
applied to UTF16LEToUTF8() in 496a1cf592.
- Inline (<< 8) statements to match UTF16LEToUTF8().
Add the following tests and results:
test/text-4-byte-UTF-16-BE-offset.xml
test/text-4-byte-UTF-16-BE.xml
test/text-4-byte-UTF-16-LE-offset.xml
test/text-4-byte-UTF-16-LE.xml
Under certain circumstances, the HTML parser would try to guess and
switch input encodings multiple times, leading to slow processing of
documents with encoding errors. The repeated scanning of the input
buffer when guessing encodings could even lead to quadratic behavior.
The code htmlCurrentChar probably assumed that if there's an encoding
handler, it is guaranteed to produce valid UTF-8. This holds true in
general, but if the detected encoding was "UTF-8", the UTF8ToUTF8
encoding handler simply invoked memcpy without checking for invalid
UTF-8. This still must be fixed, preferably by not using this handler
at all.
Also leave a note that switching encodings twice seems impossible to
implement correctly. Add a check when handling UTF-8 encoding errors
in htmlCurrentChar to avoid this situation, even if encoders produce
invalid UTF-8.
Found by OSS-Fuzz.
The return type of xmlRegisterCharEncodingHandler() is void. The invoker
cannot determine whether xmlRegisterCharEncodingHandler() is executed
successfully. when nbCharEncodingHandler >= MAX_ENCODING_HANDLERS, the
"handler" is not added to the array "handlers". As a result, the memory
of "handler" cannot be managed and released: memory leakage.
so add "xmlfree(handler)" to fix memory leakage on the failure branch of
xmlRegisterCharEncodingHandler().
Reported-by: wuqing <wuqing30@huawei.com>
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Make xmlEncInputChunk and xmlEncOutputChunk return 0 on success and
never a positive value.
Make xmlCharEncFirstLineInt, xmlCharEncFirstLineInt and
xmlCharEncOutFunc return the number of bytes written.
Commit 407b393d introduced a regression caused by xmlCharEncOutput
returning 0 in case of success instead of the number of bytes written.
Always use its return value for nbchars in xmlOutputBufferWrite.
Fixes#166.
Closes: https://bugzilla.gnome.org/show_bug.cgi?id=793028
It seams this line was accidentally copied over from xmlCharEncOutFunc.
In xmlCharEncOutput output is a pointer so incrementing it by ret can
point it where it wasn't supposed to be pointing. Luckily the current
implementation doesn't dereference the pointer after advancing it.
Signed-off-by: Daniel Veillard <veillard@redhat.com>
By always setting flush=TRUE when doing multiple reads, ICU
will not correctly handle truncated utf8 chars across read
boundaries.
The fix is to set flush=TRUE only on final read, and to
provide a pivot buffer which is maintained by libxml
between calls to ucnv_convertEx.
If a character can't be represented in the output encoding, it is
converted to a character reference. This used to to replace the
character in the input stream by calling xmlBufAddHead or
xmlBufferAddHead. These functions shifted the entire input array
around, leading to quadratic performance when converting a run of
non-representable characters. This is most pronounced when dumping to
memory.
Output the charref directly instead.
Found with libFuzzer.
For https://bugzilla.gnome.org/show_bug.cgi?id=711149
In Function:
int xmlCharEncCloseFunc(xmlCharEncodingHandler *handler)
If the freed handler is any one of handlers[i] list, then it will make that
hanldlers[i] as dangling. This may lead to crash issues at places where
handlers is read.
https://bugzilla.gnome.org/show_bug.cgi?id=692915
the new set of converting functions tried to limit the encoding
conversion of the raw buffer to the consumption one to work in
a more progressive fashion. Unfortunately this was bad for
performances and led to errors on progressive parsing when
a very large chunk was close to the end of the document. Fix
the new internal function and switch back to the old way of
converting. Fix another bug in the process.
Various cleanups
* configure.in: force regeneration of APIs in my environment
* buf.c buf.h enc.h encoding.c include/libxml/tree.h
include/libxml/xmlerror.h save.h tree.c: various comment cleanups
pointed by apibuild
* doc/apibuild.py: added the 3 new internal headers in the excludes
* doc/libxml2-api.xml doc/libxml2-refs.xml: regenerated the API
* doc/symbols.xml: listing new entry points for 2.9.0
* doc/devhelp/*: regenerated
* encoding.c: adds xmlCharEncFirstLineInput, xmlCharEncInput and
xmlCharEncOutput
* enc.h: the functions are not made public but added to this new header
When a node is dumped with a new encoding, we may encounter characters
that are not supported in the new encoding. libxml2 handles this by
replacing the character with character references, but in some encodings
this can result in an infinite loop when the character references
themselves contain unsupported characters.
This fixes the infinite loop by undoing a character reference substitution
when it cannot be inserted, and returning an encoder error.
This bug was noticed when looking into an infinite loop bug report for
the Ruby Nokogiri project. The original bug report, "nokogiri process
hangs on call to inner_html" is here:
https://github.com/tenderlove/nokogiri/issues/400
* encoding.c parser.c parserInternals.c: when we autodetect an encoding
but it's actually not completely compatible with the one declared
great care must be taken to not convert more than just the first line.
Led to some refactoring, more private functions and a bit of cleanup.