12732592ef
html: Remove unused epilog state
2025-02-02 11:15:45 +01:00
70bf754e24
html: Fix pull-parsing of incomplete end tags
...
Handle this HTML5 quirk in htmlParseEndTag.
2025-02-02 11:15:45 +01:00
4a776c78ec
html: Use htmlParseElementInternal in push parser
2025-02-02 11:15:45 +01:00
ba1537374b
html: Fix corner case when push-parsing HTML5 comments
2025-02-02 11:15:45 +01:00
e48fb5e4f2
html: Handle incomplete UTF-8 when push-parsing
...
For now, incomplete UTF-8 is always an error in push mode.
Eventually, we could pass chunked data to the character handler when
push-parsing. Then we'd have to handle incomplete sequences.
2025-02-02 11:15:45 +01:00
6bb2ea8e70
html: Adjust xmlDetectEncoding for HTML
...
Don't check for UTF-32 or EBCDIC.
We now perform BOM sniffing and the first step of the HTML5 prescan
algorithm (detect UTF-16 XML declarations). The rest of the algorithm
still has to be implemented.
2025-02-02 11:15:44 +01:00
227d8f739b
html: Support encoding auto-detection in push parser
...
Align with pull parser.
2025-02-02 11:15:44 +01:00
641fb1acf5
html: Fix state update in push parser
2025-02-02 11:15:44 +01:00
a86a8ae922
html: Fix push-parsing of empty documents
...
Also simplify end-of-document handling in push parser.
Align with pull parser.
2025-02-02 11:15:44 +01:00
ca81916023
include: Use intptr_t to cast between pointers and ints
2025-01-03 20:59:10 +01:00
53c131f667
doc: Make apibuild.py work again
2024-12-26 20:29:58 +01:00
0447275ef8
html: Check reallocations for overflow
2024-12-21 19:37:37 +01:00
6548ba11b8
parser: Fix argument checks in xmlCtxtParse*
...
- Raise invalid argument error.
- Free input stream if ctxt is NULL.
2024-12-13 17:57:11 +01:00
497081baab
parser: Remove remaining calls to xml{Push|Pop}Input
2024-11-19 00:25:23 +01:00
0f4f89005d
parser: Rename inputPush to xmlCtxtPushInput
2024-11-19 00:25:23 +01:00
225ed70737
html: Accelerate htmlParseCharData
2024-10-06 20:04:00 +02:00
207999793f
html: Handle numeric character references directly
2024-10-06 20:04:00 +02:00
0bc4608c50
html: Use hash table to check for duplicate attributes
2024-10-06 20:04:00 +02:00
24a6149fc4
html: Make sure that character data mode is reset
2024-10-06 20:04:00 +02:00
c32397d51f
html: Improve character class macros
2024-10-06 20:04:00 +02:00
e840655414
html: Rewrite parsing of most data
2024-10-06 20:04:00 +02:00
f77ec16db0
html: Optimize htmlParseCharData
2024-10-06 20:04:00 +02:00
440bd64c69
html: Optimize htmlParseHTMLName
2024-10-06 20:04:00 +02:00
6040785ac4
html: Deprecate AutoClose API
2024-10-06 20:04:00 +02:00
188cad68a4
html: Remove obsolete content model
2024-10-06 20:04:00 +02:00
0144f662d7
html: Remove obsolete code
2024-10-06 20:04:00 +02:00
575be6c1f1
html: Fix line numbers with CRs
2024-10-06 20:04:00 +02:00
be874d7831
html: Ignore unexpected DOCTYPE declarations
2024-10-06 20:04:00 +02:00
462bf0b7a5
html: Rework options
...
Introduce htmlCtxtSetOptions, see similar changes made to XML parser.
Add HTML_PARSE_HUGE alias. Support HTML_PARSE_BIG_LINES.
2024-10-06 20:04:00 +02:00
42c3823df0
html: Update comment
2024-10-06 20:04:00 +02:00
9f04cce695
html: Remove unused or useless return codes
...
htmlParseStartTag should always succeed (except for malloc failures).
2024-10-06 20:04:00 +02:00
e179f3ec0e
html: Stop reporting syntax errors
...
It doesn't make much sense to keep the old syntax error handling which
doesn't conform to HTML5.
Handling HTML5 parser errors is rather involved and not essential for
parsers.
2024-10-06 20:04:00 +02:00
27752f75ca
html: Fix EOF handling in start tags
2024-10-06 18:13:05 +02:00
b19d353970
html: Fix EOF handling in comments
2024-10-06 18:13:05 +02:00
17e56ac54a
html: Fix parsing of end tags
2024-10-06 18:13:05 +02:00
24a09033c9
html: Fix bogus end tags
2024-10-06 18:13:05 +02:00
bca6485476
html: Allow U+000C FORM FEED as whitespace
2024-10-06 18:13:05 +02:00
6edf1a645e
html: Fix DOCTYPE parsing
2024-10-06 18:13:05 +02:00
9678163f54
html: Don't check for valid XML characters
2024-10-06 18:13:05 +02:00
a6955c13c7
html: Parse numeric character references according to HTML5
2024-10-06 18:13:05 +02:00
4eeac30944
html: Start to fix EOF and U+0000 handling
2024-10-06 18:13:05 +02:00
e062a4a9b3
html: Add HTML5 parser option
...
This option passes tokenizer output directly to the SAX callbacks,
making it possible to test the tokenizer against the html5lib test
suite.
This will produce unbalanced calls to the startElement and endElement
callbacks, but it's the only way to support a SAX like interface for
HTML5. It can be used for filtering or rewriting HTML5, for example.
A HTML5 tree builder could then be implemented on top of the SAX
callbacks.
2024-10-06 18:13:05 +02:00
17da54c522
html: Normalize newlines
2024-10-06 18:13:05 +02:00
341dc78f24
html: Deduplicate code in htmlCurrentChar
2024-10-06 18:13:05 +02:00
3adb396d87
html: Parse bogus comments instead of ignoring them
...
Also treat XML processing instructions as bogus comments.
2024-10-06 18:13:05 +02:00
8444017578
html: Add missing calls to htmlCheckParagraph()
2024-10-06 18:13:05 +02:00
86d6b9b051
html: Deduplicate some code
2024-10-06 18:13:05 +02:00
0d324bde36
html: Simplify node info accounting
2024-10-06 18:13:05 +02:00
ccb61f599e
html: Remove duplicate calls to htmlAutoClose
2024-10-06 18:13:05 +02:00
f9ed30e972
html: HTML5 character data states
2024-10-06 18:13:05 +02:00