1
0
mirror of https://gitlab.gnome.org/GNOME/libxml2.git synced 2025-07-08 23:22:04 +03:00

553 Commits

Author SHA1 Message Date
12732592ef html: Remove unused epilog state 2025-02-02 11:15:45 +01:00
70bf754e24 html: Fix pull-parsing of incomplete end tags
Handle this HTML5 quirk in htmlParseEndTag.
2025-02-02 11:15:45 +01:00
4a776c78ec html: Use htmlParseElementInternal in push parser 2025-02-02 11:15:45 +01:00
ba1537374b html: Fix corner case when push-parsing HTML5 comments 2025-02-02 11:15:45 +01:00
e48fb5e4f2 html: Handle incomplete UTF-8 when push-parsing
For now, incomplete UTF-8 is always an error in push mode.

Eventually, we could pass chunked data to the character handler when
push-parsing. Then we'd have to handle incomplete sequences.
2025-02-02 11:15:45 +01:00
6bb2ea8e70 html: Adjust xmlDetectEncoding for HTML
Don't check for UTF-32 or EBCDIC.

We now perform BOM sniffing and the first step of the HTML5 prescan
algorithm (detect UTF-16 XML declarations). The rest of the algorithm
still has to be implemented.
2025-02-02 11:15:44 +01:00
227d8f739b html: Support encoding auto-detection in push parser
Align with pull parser.
2025-02-02 11:15:44 +01:00
641fb1acf5 html: Fix state update in push parser 2025-02-02 11:15:44 +01:00
a86a8ae922 html: Fix push-parsing of empty documents
Also simplify end-of-document handling in push parser.

Align with pull parser.
2025-02-02 11:15:44 +01:00
ca81916023 include: Use intptr_t to cast between pointers and ints 2025-01-03 20:59:10 +01:00
53c131f667 doc: Make apibuild.py work again 2024-12-26 20:29:58 +01:00
0447275ef8 html: Check reallocations for overflow 2024-12-21 19:37:37 +01:00
6548ba11b8 parser: Fix argument checks in xmlCtxtParse*
- Raise invalid argument error.
- Free input stream if ctxt is NULL.
2024-12-13 17:57:11 +01:00
497081baab parser: Remove remaining calls to xml{Push|Pop}Input 2024-11-19 00:25:23 +01:00
0f4f89005d parser: Rename inputPush to xmlCtxtPushInput 2024-11-19 00:25:23 +01:00
225ed70737 html: Accelerate htmlParseCharData 2024-10-06 20:04:00 +02:00
207999793f html: Handle numeric character references directly 2024-10-06 20:04:00 +02:00
0bc4608c50 html: Use hash table to check for duplicate attributes 2024-10-06 20:04:00 +02:00
24a6149fc4 html: Make sure that character data mode is reset 2024-10-06 20:04:00 +02:00
c32397d51f html: Improve character class macros 2024-10-06 20:04:00 +02:00
e840655414 html: Rewrite parsing of most data 2024-10-06 20:04:00 +02:00
f77ec16db0 html: Optimize htmlParseCharData 2024-10-06 20:04:00 +02:00
440bd64c69 html: Optimize htmlParseHTMLName 2024-10-06 20:04:00 +02:00
6040785ac4 html: Deprecate AutoClose API 2024-10-06 20:04:00 +02:00
188cad68a4 html: Remove obsolete content model 2024-10-06 20:04:00 +02:00
0144f662d7 html: Remove obsolete code 2024-10-06 20:04:00 +02:00
575be6c1f1 html: Fix line numbers with CRs 2024-10-06 20:04:00 +02:00
be874d7831 html: Ignore unexpected DOCTYPE declarations 2024-10-06 20:04:00 +02:00
462bf0b7a5 html: Rework options
Introduce htmlCtxtSetOptions, see similar changes made to XML parser.

Add HTML_PARSE_HUGE alias. Support HTML_PARSE_BIG_LINES.
2024-10-06 20:04:00 +02:00
42c3823df0 html: Update comment 2024-10-06 20:04:00 +02:00
9f04cce695 html: Remove unused or useless return codes
htmlParseStartTag should always succeed (except for malloc failures).
2024-10-06 20:04:00 +02:00
e179f3ec0e html: Stop reporting syntax errors
It doesn't make much sense to keep the old syntax error handling which
doesn't conform to HTML5.

Handling HTML5 parser errors is rather involved and not essential for
parsers.
2024-10-06 20:04:00 +02:00
27752f75ca html: Fix EOF handling in start tags 2024-10-06 18:13:05 +02:00
b19d353970 html: Fix EOF handling in comments 2024-10-06 18:13:05 +02:00
17e56ac54a html: Fix parsing of end tags 2024-10-06 18:13:05 +02:00
24a09033c9 html: Fix bogus end tags 2024-10-06 18:13:05 +02:00
bca6485476 html: Allow U+000C FORM FEED as whitespace 2024-10-06 18:13:05 +02:00
6edf1a645e html: Fix DOCTYPE parsing 2024-10-06 18:13:05 +02:00
9678163f54 html: Don't check for valid XML characters 2024-10-06 18:13:05 +02:00
a6955c13c7 html: Parse numeric character references according to HTML5 2024-10-06 18:13:05 +02:00
4eeac30944 html: Start to fix EOF and U+0000 handling 2024-10-06 18:13:05 +02:00
e062a4a9b3 html: Add HTML5 parser option
This option passes tokenizer output directly to the SAX callbacks,
making it possible to test the tokenizer against the html5lib test
suite.

This will produce unbalanced calls to the startElement and endElement
callbacks, but it's the only way to support a SAX like interface for
HTML5. It can be used for filtering or rewriting HTML5, for example.

A HTML5 tree builder could then be implemented on top of the SAX
callbacks.
2024-10-06 18:13:05 +02:00
17da54c522 html: Normalize newlines 2024-10-06 18:13:05 +02:00
341dc78f24 html: Deduplicate code in htmlCurrentChar 2024-10-06 18:13:05 +02:00
3adb396d87 html: Parse bogus comments instead of ignoring them
Also treat XML processing instructions as bogus comments.
2024-10-06 18:13:05 +02:00
8444017578 html: Add missing calls to htmlCheckParagraph() 2024-10-06 18:13:05 +02:00
86d6b9b051 html: Deduplicate some code 2024-10-06 18:13:05 +02:00
0d324bde36 html: Simplify node info accounting 2024-10-06 18:13:05 +02:00
ccb61f599e html: Remove duplicate calls to htmlAutoClose 2024-10-06 18:13:05 +02:00
f9ed30e972 html: HTML5 character data states 2024-10-06 18:13:05 +02:00