Don't create a copy of the whole input buffer. Read the data chunk by
chunk to save memory.
Historically, it was probably envisioned to read data from memory
without additional copying. This doesn't work reliably with the current
design of the XML parser which requires a terminating null byte at the
end of input buffers. This lead to xmlReadMemory interfaces, which
expect pointer and size arguments, being changed to make a
zero-terminated copy of the input buffer. Interfaces based on
xmlReadDoc, which actually expect a zero-terminated string and
would make zero-copy operation work, were then simplified to rely on
xmlReadMemoryi, resulting in an unnecessary copy.
To avoid copying (possibly gigabytes) of memory temporarily, we now
stream in-memory input just like content read from files in a
chunk-by-chunk fashion (using a somewhat outdated INPUT_CHUNK size of
250 bytes). As a side effect, we also avoid another copy of the whole
input when handling non-UTF-8 data which was made possible by some
earlier commits.
Interfaces expecting zero-terminated strings now make use of strnlen
which unfortunately isn't part of the standard C library and only
mandated since POSIX 2008.
XIncludes involve XPath processing which can still lead to timeouts when
fuzzing. This will probably take a while to fix. The rest of the XML
parsing code should hopefully run without timeouts now. OSS-Fuzz only
shows a single timeout test case, so separate the XInclude from the core
XML fuzzer.
Now that entity expansion issues should be fixed, we should get more
interesting timeout errors from OSS-Fuzz. Disable XInclude for now,
since it often timeouts in XPath computations. The XInclude tests should
be moved to a separate fuzz target.
With very few exceptions, utilities and test programs don't require any
external libraries.
- xmllint and xmlcatalog need libreadline
- runtest and testThreads need pthreads
Brown paper bag release, some recently added sources were missing from
the 2.9.11 tarball:
- configure.ac: bump version
- fuzz/Makefile.am: add fuzz.h and seed/regexp to EXTRA_DIST
OSS-Fuzz has been fuzzing the HTML parser with inputs up to 1 MB for
several hundred hours without hitting the 20s timeout. It seems that
most timeouts resulting from accidentally quadratic behavior in the
HTML parser have been fixed. Start to gradually reduce the timeout to
find new performance issues.
htmlDocDumpMemory uses the "HTML" encoding if no other encoding was
specified in the source HTML. This encoding can be extremely slow
because of an inefficiency in htmlEntityValueLookup. Stop encoding
the output for now.
Remove the libfuzzer max_len option which doesn't apply to other
fuzzing engines. Enforce the maximum length directly in the fuzz
targets. For the xml target, lower the maximum when expanding entities
to avoid timeout and OOM errors.
Always limit nested functions calls to 5000. This avoids call stack
overflows with deeply nested expressions.
The expression parser produces about 10 nested function calls when
parsing a subexpression in parentheses, so the effective nesting limit
is about 500 which should be more than enough.
Use a lower limit when fuzzing to account for increased memory usage
when using sanitizers.