Nick Wellnhofer
8873a49846
html: Fix areBlanks check
...
Short-lived regression from 71122421 .
2025-03-09 16:21:13 +01:00
Nick Wellnhofer
5f0b1378d7
parser: Add more parser context accessors
...
Fixes #763 .
2025-03-08 22:36:06 +01:00
Nick Wellnhofer
5237d90fae
html: Process data before switching encoding
...
This reduces the amount of data to convert and avoids issues with EOF
detection.
Also reset EOF flag after switching encoding as a precaution.
2025-03-07 21:19:16 +01:00
Nick Wellnhofer
0b27097a92
encoding: Rename unprefixed public functions
2025-03-04 16:46:53 +01:00
Nick Wellnhofer
5ed4eafd8a
html: Don't invoke SAX callbacks if parser was stopped
2025-02-22 14:52:47 +01:00
Nick Wellnhofer
63dfcca670
fuzz: Reduce initial array size
2025-02-20 12:22:12 +01:00
Nick Wellnhofer
b8234e8c73
html: Fix check for partial named character references
...
Digits are allowed after the first character.
2025-02-19 12:53:32 +01:00
Nick Wellnhofer
7a61c32bfa
html: Use enum instead of magic values for insertion modes
2025-02-17 11:41:57 +01:00
Nick Wellnhofer
8cf6129bbd
html: Stop implying <p> start tags
...
Only <html>, <head> or <body> should be implied. Opening extra <p> tags
has always been a libxml2 quirk.
2025-02-13 20:20:17 +01:00
Nick Wellnhofer
71122421a1
html: Make implied <p> tags more deterministic
...
libxml2's HTML parser adds <p> start tags in some situations. This
behavior, which doesn't follow any standard, was added in 2000, see
here: http://veillard.com/XML/messages/0655.html
Text nodes that only contain whitespace don't imply a <p> tag, but the
whitespace check cannot work reliably if we're parsing partial text data
which can happen with both pull and push parser.
The logic in `areBlanks` is hard to follow. The checks involving `CUR`
depend on the position of the input pointer and seem dubious. It's also
possible that the behavior changed inadvertently with a later commit.
As a result, it's hard to come up with good test cases.
We now process leading whitespace before creating implied tags. This is
more in line with HTML5 and should avoid at least some issues with
partial text data.
For example, parsing the string "<head> x" used to result in:
<html>
<head></head>
<body><p> x</p></body>
</html>
And now results in:
<html>
<head> </head>
<body><p>x</p></body>
</html>
Except for the implied <p> tag, this matches HTML5.
2025-02-13 14:31:44 +01:00
Nick Wellnhofer
8d7e38d536
fuzz: Ignore encodings when fuzzing on Apple
...
Not long ago, Apple decided to replace GNU libiconv with a patched up
version of FreeBSD's iconv implementation in their operating systems.
Unfortunately, the quality of both the original implementation as well
as Apple's patches is so abysmal that you routinely find issues when
fuzzing your own code.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
68be036f29
fuzz: Disable HTML encoding detection for now
...
This doesn't work with the push parser.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
c13fcc1910
html: Chunk text data in push parser
...
Follow the logic of the XML parser and chunk large text nodes.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
080285724b
html: Make data parsing modes work with push parser
...
This can't be solved with a simple scan for a terminator. Instead, we
make htmlParseCharData handle incomplete data if the "partial" flag is
set.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
4be1e8befb
html: Simplify htmlParseTryOrFinish a little
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
12732592ef
html: Remove unused epilog state
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
70bf754e24
html: Fix pull-parsing of incomplete end tags
...
Handle this HTML5 quirk in htmlParseEndTag.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
4a776c78ec
html: Use htmlParseElementInternal in push parser
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
ba1537374b
html: Fix corner case when push-parsing HTML5 comments
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
e48fb5e4f2
html: Handle incomplete UTF-8 when push-parsing
...
For now, incomplete UTF-8 is always an error in push mode.
Eventually, we could pass chunked data to the character handler when
push-parsing. Then we'd have to handle incomplete sequences.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
6bb2ea8e70
html: Adjust xmlDetectEncoding for HTML
...
Don't check for UTF-32 or EBCDIC.
We now perform BOM sniffing and the first step of the HTML5 prescan
algorithm (detect UTF-16 XML declarations). The rest of the algorithm
still has to be implemented.
2025-02-02 11:15:44 +01:00
Nick Wellnhofer
227d8f739b
html: Support encoding auto-detection in push parser
...
Align with pull parser.
2025-02-02 11:15:44 +01:00
Nick Wellnhofer
641fb1acf5
html: Fix state update in push parser
2025-02-02 11:15:44 +01:00
Nick Wellnhofer
a86a8ae922
html: Fix push-parsing of empty documents
...
Also simplify end-of-document handling in push parser.
Align with pull parser.
2025-02-02 11:15:44 +01:00
Nick Wellnhofer
ca81916023
include: Use intptr_t to cast between pointers and ints
2025-01-03 20:59:10 +01:00
Nick Wellnhofer
53c131f667
doc: Make apibuild.py work again
2024-12-26 20:29:58 +01:00
Nick Wellnhofer
0447275ef8
html: Check reallocations for overflow
2024-12-21 19:37:37 +01:00
Nick Wellnhofer
6548ba11b8
parser: Fix argument checks in xmlCtxtParse*
...
- Raise invalid argument error.
- Free input stream if ctxt is NULL.
2024-12-13 17:57:11 +01:00
Nick Wellnhofer
497081baab
parser: Remove remaining calls to xml{Push|Pop}Input
2024-11-19 00:25:23 +01:00
Nick Wellnhofer
0f4f89005d
parser: Rename inputPush to xmlCtxtPushInput
2024-11-19 00:25:23 +01:00
Nick Wellnhofer
225ed70737
html: Accelerate htmlParseCharData
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
207999793f
html: Handle numeric character references directly
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
0bc4608c50
html: Use hash table to check for duplicate attributes
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
24a6149fc4
html: Make sure that character data mode is reset
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
c32397d51f
html: Improve character class macros
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
e840655414
html: Rewrite parsing of most data
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
f77ec16db0
html: Optimize htmlParseCharData
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
440bd64c69
html: Optimize htmlParseHTMLName
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
6040785ac4
html: Deprecate AutoClose API
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
188cad68a4
html: Remove obsolete content model
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
0144f662d7
html: Remove obsolete code
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
575be6c1f1
html: Fix line numbers with CRs
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
be874d7831
html: Ignore unexpected DOCTYPE declarations
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
462bf0b7a5
html: Rework options
...
Introduce htmlCtxtSetOptions, see similar changes made to XML parser.
Add HTML_PARSE_HUGE alias. Support HTML_PARSE_BIG_LINES.
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
42c3823df0
html: Update comment
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
9f04cce695
html: Remove unused or useless return codes
...
htmlParseStartTag should always succeed (except for malloc failures).
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
e179f3ec0e
html: Stop reporting syntax errors
...
It doesn't make much sense to keep the old syntax error handling which
doesn't conform to HTML5.
Handling HTML5 parser errors is rather involved and not essential for
parsers.
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
27752f75ca
html: Fix EOF handling in start tags
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
b19d353970
html: Fix EOF handling in comments
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
17e56ac54a
html: Fix parsing of end tags
2024-10-06 18:13:05 +02:00