If DllMain is used, rely on it working as expected. The old code seemed
to attempt to free global state of other threads if, for some reason,
the DllMain mechanism didn't work.
In a static build, register a destructor with
RegisterWaitForSingleObject.
Make public functions xmlGetGlobalState and xmlInitializeGlobalState
no-ops.
Move initialization and registration of global state objects to
xmlInitGlobalState. Lookup global state with xmlGetThreadLocalStorage
which can be inlined nicely.
Also cleanup global state when using TLS. xmlLastError must be reset.
Update hash function from classic Jenkins OAAT (dict.c) and a variant of
DJB2 (hash.c) to "GoodOAAT" taken from the SMHasher repo. This hash
function passes all SMHasher tests.
When legacy support is requested, always enable stubs for FTP and
XPointer location modules which were removed from the standard
configuration. Going forward, the --with-legacy configuration option
should be used to provide maximum ABI compatibility.
Fixes#433.
Even with flush set to true, xmlCharEncInput didn't guarantee to decode
all data. This complicated the push parser.
Remove the flush flag and always decode all available data.
Also fix ICU code where the flush flag has a different meaning. Always
set flush to false and retry even with empty input buffers.
Don't create a copy of the whole input buffer. Read the data chunk by
chunk to save memory.
Historically, it was probably envisioned to read data from memory
without additional copying. This doesn't work reliably with the current
design of the XML parser which requires a terminating null byte at the
end of input buffers. This lead to xmlReadMemory interfaces, which
expect pointer and size arguments, being changed to make a
zero-terminated copy of the input buffer. Interfaces based on
xmlReadDoc, which actually expect a zero-terminated string and
would make zero-copy operation work, were then simplified to rely on
xmlReadMemoryi, resulting in an unnecessary copy.
To avoid copying (possibly gigabytes) of memory temporarily, we now
stream in-memory input just like content read from files in a
chunk-by-chunk fashion (using a somewhat outdated INPUT_CHUNK size of
250 bytes). As a side effect, we also avoid another copy of the whole
input when handling non-UTF-8 data which was made possible by some
earlier commits.
Interfaces expecting zero-terminated strings now make use of strnlen
which unfortunately isn't part of the standard C library and only
mandated since POSIX 2008.
Introduce XML_INPUT_HAS_ENCODING flag for xmlParserInput which is set
when xmlSwitchEncoding is called. The parser can use the flag to
reliably detect whether an encoding was already set via user override,
BOM or other auto-detection. In this case, the encoding declaration
won't be used to switch the encoding.
Before, an inscrutable mix of ctxt->charset, ctxt->input->encoding
and ctxt->input->buf->encoder was used.
Introduce private helper functions to switch encodings used by both the
XML and HTML parser:
- xmlDetectEncoding which skips over the BOM, allowing to remove the
BOM checks from other encoding functions.
- xmlSetDeclaredEncoding, replacing htmlCheckEncodingDirect, which warns
about encoding mismatches.
If users override the encoding, store the declared instead of the actual
encoding in xmlDoc. In this case, the actual encoding is known and the
raw value from the doc is more useful.
Also use the input flags to store the ISO-8859-1 fallback state.
Restrict the fallback to cases where no encoding was specified. (The
fallback is only useful in recovery mode and these days broken UTF-8 is
probably more likely than ISO-8859-1, so it might eventually be removed
completely.)
The 'charset' member of xmlParserCtxt is now unused. The 'encoding'
member of xmlParserInput is now unused.
The 'standalone' member of xmlParserInput is renamed to 'flags'.
A new parser state XML_PARSER_XML_DECL is added for the push parser.
To detect EBCDIC code pages, we used to switch the encoding twice and
had to be very careful not to decode data after the XML declaration
before the second switch. This relied on a hard-coded expected size of
the XML declaration and was complicated and unreliable.
Now we convert the first 200 bytes to EBCDIC-US and parse the encoding
declaration manually.
Don't try to grow the input buffer in xmlParserShrink. This makes sure
that no memory allocations are made and the function always succeeds.
Remove unnecessary invocations of SHRINK. Invoke SHRINK at the end of
DTD parsing loops.
Shrink before growing.
Note that this changes the return value of public function
xmlSchemaInitTypes from void to int. This shouldn't break the ABI on
most platforms.
Found when investigating #500.
There's too much code which assumes that if ctxt->value is non-null,
a value can be successfully popped off the stack. This assumption can
break with stack frames when malloc fails.
Instead of trying to fix all call sites, remove the stack frame logic.
It only offered very little protection against misbehaving extension
functions. We already check the stack size after a function call which
should be enough.
Found by OSS-Fuzz.
In xpath.c there's a lot of code like:
valuePush(ctxt, xmlCacheNewX());
...
valuePop(ctxt);
If xmlCacheNewX fails, no value will be pushed on the stack. If there's
no error check in between, valuePop will pop an unrelated value which
can lead to use-after-free errors.
Instead of trying to fix all call sites, we simply stop popping values
if an error was signaled. This requires to change the CHECK_TYPE macro
which is often used to determine whether a value can be safely popped.
Found with libFuzzer, see #344.
Reporting errors is expensive and some abusive test cases can generate
an error for each invalid input byte. This causes the parser to spend
most of the time with error handling. Limit the number of errors and
warnings to 100.
This commit implements robust detection of entity amplification attacks,
better known as the "billion laughs" attack.
We now limit the size of the document after substitution of entities to
10 times the size before expansion. This guarantees linear behavior by
definition. There already was a similar check before, but the accounting
of "sizeentities" (size of external entities) and "sizeentcopy" (size of
all copies created by entity references) wasn't accurate.
We also need saturation arithmetic since we're historically limited to
"unsigned long" which is 32-bit on many platforms.
A maximum of 10 MB of substitutions is always allowed. This should make
use cases like DITA work which have caused problems in the past.
The old checks based on the number of entities were removed. This is
accounted for by adding a fixed cost to each entity reference.
Entity amplification checks are now enabled even if XML_PARSE_HUGE is
set. This option is mainly used to allow larger text nodes. Most users
were unaware that it also disabled entity expansion checks.
Some of the limits might be adjusted later. If this change turns out to
affect legitimate use cases, we can add a separate parser option to
disable the checks.
Fixes#294.
Fixes#345.
Instead of abusing the LSB of the "checked" member, store the result
of testing for occurrence of '<' character in "flags".
Also use the flags in xmlParseStringEntityRef instead of rescanning
every time.
To check whether an entity was already parsed, the code previously
tested whether "checked" was non-zero or "children" was non-null. The
"children" check could be unreliable because an empty entity also
results in an empty (NULL) node list. Use a separate flag to make this
check more reliable.