Remove inaccurate xmlParseCheckTransition check.
Remove non-incremental xmlParseGetLasts check.
Add functions that check for several boundary constructs more
accurately, keeping track of progress in ctxt->checkIndex.
Fixes#439.
Remove explicit integer casts as final operation
- in assignments
- when passing arguments
- when returning values
Remove casts
- to the same type
- from certain range-bound values
The main motivation is that these explicit casts don't change the result
of operations and only render UBSan's implicit-conversion checks
useless. Removing these casts allows UBSan to detect cases where
truncation or sign-changes occur unexpectedly.
Document some explicit casts as truncating and add a few missing ones.
Private functions were previously declared
- in header files in the root directory
- in public headers guarded with IN_LIBXML
- in libxml.h
- redundantly in source files that used them.
Consolidate all private header files in include/private.
If the legacy functions are disabled, the default "V1" HTML SAX handler
isn't initialized in threads other than the main thread.
htmlInitParserCtxt would later use the empty V1 SAX handler, resulting
in NULL documents.
Change htmlInitParserCtxt to initialize the HTML SAX handler by calling
xmlSAX2InitHtmlDefaultSAXHandler. This removes the ability to change the
default handler but is more in line with the XML parser which
initializes the SAX handler by calling xmlSAXVersion, ignoring the V1
default handler.
Fixes#399.
xmlCtxtReadDoc used to create an input stream involving
xmlNewStringInputStream. This would create a stream without an input
buffer, causing problems with encodings (see #34).
After commit aab584dc3, an error was returned even with UTF-8 encodings
which happened to work before.
Make xmlCtxtReadDoc call xmlCtxtReadMemory which doesn't suffer from
these issues. Also fix htmlCtxtReadDoc.
Fixes#397.
Commit 4fd69f3e fixed handling of '<' characters not followed by an
ASCII letter. But a '<!' sequence followed by invalid characters should
be treated as bogus comment and skipped.
Fixes#380.
* HTMLparser.c:
(htmlSkipBlankChars):
* parser.c:
(xmlSkipBlankChars):
- Cap the return value at INT_MAX.
- The commit range that OSS-Fuzz listed for the fix didn't make
any changes to xmlSkipBlankChars(), so it seems like this
issue may still exist.
Found by OSS-Fuzz Issue 44803.
These functions shouldn't be part of the public API. Most init
functions are only thread-safe when called from xmlInitParser. Global
variables should only be cleaned up by calling xmlCleanupParser.
Only try to parse a start tag if there's a '<' followed by an ASCII
letter. This is more in line with HTML5 and the old behavior in
recovery mode. Emit a literal '<' if the following character is
invalid.
Fixes#101.
Fixes#339.
Use a bitmask instead of magic values to
- keep track whether the validation context is part of a parser context
- keep track whether xmlValidateDtdFinal was called
This allows to add addtional flags later.
Note that this deliberately changes the name of a public struct member,
assuming that this was always private data never to be used by client
code.
The old approach introduced a regression, see issue #312 and the
previous commit. Disable code that tries to recover from invalid start
tags. This only affects "recovery" mode.
Add a comment outlining a better fix in accordance with the HTML5 spec.
Revert part of commit 173a0830 that changed behavior when parsing
malformed start tags with the push parser. This reintroduces quadratic
behavior in recovery mode which will be worked around in the next
commit.
Fixes#312.
Fix regression introduced with b25acce8. Some users like libxslt may
call the HTML output functions on documents with uppercase tag names,
so we must keep case-insensitive string comparison.
Fixes#248.
Switch to binary search. This is the first time bsearch is used in the
libxml2 code base. But it's a standard library function since C89 and
should be portable.
Under certain circumstances, the HTML parser would try to guess and
switch input encodings multiple times, leading to slow processing of
documents with encoding errors. The repeated scanning of the input
buffer when guessing encodings could even lead to quadratic behavior.
The code htmlCurrentChar probably assumed that if there's an encoding
handler, it is guaranteed to produce valid UTF-8. This holds true in
general, but if the detected encoding was "UTF-8", the UTF8ToUTF8
encoding handler simply invoked memcpy without checking for invalid
UTF-8. This still must be fixed, preferably by not using this handler
at all.
Also leave a note that switching encodings twice seems impossible to
implement correctly. Add a check when handling UTF-8 encoding errors
in htmlCurrentChar to avoid this situation, even if encoders produce
invalid UTF-8.
Found by OSS-Fuzz.
Null bytes in the input stream do not necessarily signal an EOF
condition. Check the stream pointers for EOF to avoid quadratic
rescanning of input data.
Note that the CUR_CHAR macro used in functions like htmlParseCharData
calls htmlCurrentChar which translates null bytes.
Found by OSS-Fuzz.