When push parsing, we want to convert as much of the input as possible.
When pull parsing memory buffers, we want to convert data chunk by chunk
to save memory.
This was only used by Chromium/WebKit to detect whether xmlParseContent
really succeeded. It's a horrible, overcomplicated hack.
See 8c5848bd and #767.
Reuse some of the old members.
The "input" and "output" function pointers are actually of type
xmlCharEncConvFunc, accepting an additional argument. For default
handlers, this argument is unused, so this should work with most ABIs.
For iconv handlers, these function pointers used to be NULL but now
point to a function which requires the extra argument.
"iconv_in" and "iconv_out" are made void pointers. "uconv_in" and
"uconv_out" are renamed and made void pointers. This is unlikely to
cause issues.
We now expect that the built-in conversion functions correctly report
XML_ENC_ERR_SPACE. For UTF8ToHtml and the ISO-8859-X code, this will be
done in the following commits.
Add missing xmlCharEncoding enum values.
Simplify and speed up encoding lookup by using a table mapping names to
xmlCharEncoding enums and binary search. Rearrange the default handler
table to match the enum layout.
For some encodings we now only lookup the provided or most canonical
name instead of trying several names, expecting that iconv or ICU handle
aliases:
- IBM037 (EBCDIC)
- UCS-2
- UCS-4
- Shift_JIS
When looking up encodings with xmlLookupCharEncodingHandler, the
returned handler can have a different name than requested
(capitalization, internal aliases). This should eventually be fixed.
For now we revert part of commit 5b893fa9, start the lookup with
xmlFindHandler and add an explicit check for UTF-8.
Should fix the encoding name issue mentioned in #749.
Make xmlOpenCharEncodingHandler call xmlParseCharEncoding first so we
prefer our own handlers for names like "UTF8". Only UTF-16 needs an
exception.
Make callers check the return value. For UTF-8, a NULL encoding doesn't
mean an error.
Remove unnecessary UTF-8 check from htmlFindOutputEncoder. Don't try to
look up ASCII handler since the HTML handler is always available.
Fix return code of xmlParseCharEncoding.
Should fix#744.
Some exotic encodings like ISO646-FR don't support '#' characters, so
encoding a character reference can actually fail. Don't skip the
offending input in this case so the error will be reported on the next
call.
Introduce new API functions that return a separate error code if a
memory allocation fails.
- xmlOpenCharEncodingHandler
- xmlLookupCharEncodingHandler
Fix a few places where malloc failures weren't reported.
Even with flush set to true, xmlCharEncInput didn't guarantee to decode
all data. This complicated the push parser.
Remove the flush flag and always decode all available data.
Also fix ICU code where the flush flag has a different meaning. Always
set flush to false and retry even with empty input buffers.
Fix short-lived regression from previous commit.
It might be safer to make xmlBufSetInputBaseCur use the original buffer
even in case of errors.
Found by OSS-Fuzz.
Make sure that xmlCharEncInput, xmlParserInputBufferPush and
xmlParserInputBufferGrow set the correct error code in the
xmlParserInputBuffer. Handle errors when calling these functions.