Introduce XML_INPUT_HAS_ENCODING flag for xmlParserInput which is set
when xmlSwitchEncoding is called. The parser can use the flag to
reliably detect whether an encoding was already set via user override,
BOM or other auto-detection. In this case, the encoding declaration
won't be used to switch the encoding.
Before, an inscrutable mix of ctxt->charset, ctxt->input->encoding
and ctxt->input->buf->encoder was used.
Introduce private helper functions to switch encodings used by both the
XML and HTML parser:
- xmlDetectEncoding which skips over the BOM, allowing to remove the
BOM checks from other encoding functions.
- xmlSetDeclaredEncoding, replacing htmlCheckEncodingDirect, which warns
about encoding mismatches.
If users override the encoding, store the declared instead of the actual
encoding in xmlDoc. In this case, the actual encoding is known and the
raw value from the doc is more useful.
Also use the input flags to store the ISO-8859-1 fallback state.
Restrict the fallback to cases where no encoding was specified. (The
fallback is only useful in recovery mode and these days broken UTF-8 is
probably more likely than ISO-8859-1, so it might eventually be removed
completely.)
The 'charset' member of xmlParserCtxt is now unused. The 'encoding'
member of xmlParserInput is now unused.
The 'standalone' member of xmlParserInput is renamed to 'flags'.
A new parser state XML_PARSER_XML_DECL is added for the push parser.
It seems that this code path could only be triggered after an encoding
error in recovery mode. Creating char-ref nodes is unnecessary and
typically unexpected.
Follow-up to commit d0c3f01e. A parser context will be initialized to
SAX version 2, but this can be overridden with XML_PARSE_SAX1 later,
so we must initialize the SAX1 element handlers as well.
Change the check in xmlDetectSAX2 to only look for XML_SAX2_MAGIC, so
we don't switch to SAX1 if the SAX2 element handlers are NULL.
For some reason, xmlCtxtUseOptionsInternal set the start and end element
SAX handlers to the internal DOM builder functions when XML_PARSE_SAX1
was specified. This means that custom SAX handlers could never work with
that flag because these functions would receive the wrong user data
argument and crash immediately.
Fixes#535.
Make sure that xmlCharEncInput, xmlParserInputBufferPush and
xmlParserInputBufferGrow set the correct error code in the
xmlParserInputBuffer. Handle errors when calling these functions.
Revert another change from commit 98840d40.
Decode the whole buffer when reading from memory and switching to the
initial encoding. Add some comments about potential improvements.
* parser.c:
(xmlParseCharData):
- Check if the parser has stopped before advancing
`ctxt->input->cur`. This only occurs if a custom SAX error
handler calls xmlStopParser() on fatal errors.
Fixes#518.
To detect EBCDIC code pages, we used to switch the encoding twice and
had to be very careful not to decode data after the XML declaration
before the second switch. This relied on a hard-coded expected size of
the XML declaration and was complicated and unreliable.
Now we convert the first 200 bytes to EBCDIC-US and parse the encoding
declaration manually.
Don't try to grow the input buffer in xmlParserShrink. This makes sure
that no memory allocations are made and the function always succeeds.
Remove unnecessary invocations of SHRINK. Invoke SHRINK at the end of
DTD parsing loops.
Shrink before growing.
The input buffer is now grown reliably when calling CUR_CHAR
(xmlCurrentChar) or NEXT (xmlNextChar). This allows to remove many
other invocations of GROW.
- Lower the amount of expansion which is always allowed from
10MB to 1MB.
- Lower the maximum amplification factor from 10 to 5.
- Lower the "fixed cost" from 50 to 20.
If we try to continue parsing after an error in the internal or external
subset, entity expansion accounting gets more complicated. Simply halt
the parser.
Found with libFuzzer.
Don't set the "checked" flag when checking entities in default attribute
values. These entities could reference other entities which weren't
defined yet, so the check isn't reliable.
This fixes a short-lived regression which could lead to a call stack
overflow later in xmlStringGetNodeList.
Reporting errors is expensive and some abusive test cases can generate
an error for each invalid input byte. This causes the parser to spend
most of the time with error handling. Limit the number of errors and
warnings to 100.
Only add consumed bytes if
- we're not parsing an entity
- we're parsing external parameter entities for the first time.
Always ignore internal parameter entities.