libxml2

mirror of https://gitlab.gnome.org/GNOME/libxml2.git synced 2026-01-26 21:41:34 +03:00

Author	SHA1	Message	Date
Nick Wellnhofer	8873a49846	html: Fix areBlanks check Short-lived regression from `71122421`.	2025-03-09 16:21:13 +01:00
Nick Wellnhofer	5f0b1378d7	parser: Add more parser context accessors Fixes #763.	2025-03-08 22:36:06 +01:00
Nick Wellnhofer	5237d90fae	html: Process data before switching encoding This reduces the amount of data to convert and avoids issues with EOF detection. Also reset EOF flag after switching encoding as a precaution.	2025-03-07 21:19:16 +01:00
Nick Wellnhofer	0b27097a92	encoding: Rename unprefixed public functions	2025-03-04 16:46:53 +01:00
Nick Wellnhofer	5ed4eafd8a	html: Don't invoke SAX callbacks if parser was stopped	2025-02-22 14:52:47 +01:00
Nick Wellnhofer	63dfcca670	fuzz: Reduce initial array size	2025-02-20 12:22:12 +01:00
Nick Wellnhofer	b8234e8c73	html: Fix check for partial named character references Digits are allowed after the first character.	2025-02-19 12:53:32 +01:00
Nick Wellnhofer	7a61c32bfa	html: Use enum instead of magic values for insertion modes	2025-02-17 11:41:57 +01:00
Nick Wellnhofer	8cf6129bbd	html: Stop implying <p> start tags Only <html>, <head> or <body> should be implied. Opening extra <p> tags has always been a libxml2 quirk.	2025-02-13 20:20:17 +01:00
Nick Wellnhofer	71122421a1	html: Make implied <p> tags more deterministic libxml2's HTML parser adds <p> start tags in some situations. This behavior, which doesn't follow any standard, was added in 2000, see here: http://veillard.com/XML/messages/0655.html Text nodes that only contain whitespace don't imply a <p> tag, but the whitespace check cannot work reliably if we're parsing partial text data which can happen with both pull and push parser. The logic in `areBlanks` is hard to follow. The checks involving `CUR` depend on the position of the input pointer and seem dubious. It's also possible that the behavior changed inadvertently with a later commit. As a result, it's hard to come up with good test cases. We now process leading whitespace before creating implied tags. This is more in line with HTML5 and should avoid at least some issues with partial text data. For example, parsing the string "<head> x" used to result in: <html> <head></head> <body><p> x</p></body> </html> And now results in: <html> <head> </head> <body><p>x</p></body> </html> Except for the implied <p> tag, this matches HTML5.	2025-02-13 14:31:44 +01:00
Nick Wellnhofer	8d7e38d536	fuzz: Ignore encodings when fuzzing on Apple Not long ago, Apple decided to replace GNU libiconv with a patched up version of FreeBSD's iconv implementation in their operating systems. Unfortunately, the quality of both the original implementation as well as Apple's patches is so abysmal that you routinely find issues when fuzzing your own code.	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	68be036f29	fuzz: Disable HTML encoding detection for now This doesn't work with the push parser.	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	c13fcc1910	html: Chunk text data in push parser Follow the logic of the XML parser and chunk large text nodes.	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	080285724b	html: Make data parsing modes work with push parser This can't be solved with a simple scan for a terminator. Instead, we make htmlParseCharData handle incomplete data if the "partial" flag is set.	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	4be1e8befb	html: Simplify htmlParseTryOrFinish a little	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	12732592ef	html: Remove unused epilog state	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	70bf754e24	html: Fix pull-parsing of incomplete end tags Handle this HTML5 quirk in htmlParseEndTag.	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	4a776c78ec	html: Use htmlParseElementInternal in push parser	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	ba1537374b	html: Fix corner case when push-parsing HTML5 comments	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	e48fb5e4f2	html: Handle incomplete UTF-8 when push-parsing For now, incomplete UTF-8 is always an error in push mode. Eventually, we could pass chunked data to the character handler when push-parsing. Then we'd have to handle incomplete sequences.	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	6bb2ea8e70	html: Adjust xmlDetectEncoding for HTML Don't check for UTF-32 or EBCDIC. We now perform BOM sniffing and the first step of the HTML5 prescan algorithm (detect UTF-16 XML declarations). The rest of the algorithm still has to be implemented.	2025-02-02 11:15:44 +01:00
Nick Wellnhofer	227d8f739b	html: Support encoding auto-detection in push parser Align with pull parser.	2025-02-02 11:15:44 +01:00
Nick Wellnhofer	641fb1acf5	html: Fix state update in push parser	2025-02-02 11:15:44 +01:00
Nick Wellnhofer	a86a8ae922	html: Fix push-parsing of empty documents Also simplify end-of-document handling in push parser. Align with pull parser.	2025-02-02 11:15:44 +01:00
Nick Wellnhofer	ca81916023	include: Use intptr_t to cast between pointers and ints	2025-01-03 20:59:10 +01:00
Nick Wellnhofer	53c131f667	doc: Make apibuild.py work again	2024-12-26 20:29:58 +01:00
Nick Wellnhofer	0447275ef8	html: Check reallocations for overflow	2024-12-21 19:37:37 +01:00
Nick Wellnhofer	6548ba11b8	parser: Fix argument checks in xmlCtxtParse* - Raise invalid argument error. - Free input stream if ctxt is NULL.	2024-12-13 17:57:11 +01:00
Nick Wellnhofer	497081baab	parser: Remove remaining calls to xml{Push\|Pop}Input	2024-11-19 00:25:23 +01:00
Nick Wellnhofer	0f4f89005d	parser: Rename inputPush to xmlCtxtPushInput	2024-11-19 00:25:23 +01:00
Nick Wellnhofer	225ed70737	html: Accelerate htmlParseCharData	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	207999793f	html: Handle numeric character references directly	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	0bc4608c50	html: Use hash table to check for duplicate attributes	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	24a6149fc4	html: Make sure that character data mode is reset	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	c32397d51f	html: Improve character class macros	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	e840655414	html: Rewrite parsing of most data	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	f77ec16db0	html: Optimize htmlParseCharData	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	440bd64c69	html: Optimize htmlParseHTMLName	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	6040785ac4	html: Deprecate AutoClose API	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	188cad68a4	html: Remove obsolete content model	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	0144f662d7	html: Remove obsolete code	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	575be6c1f1	html: Fix line numbers with CRs	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	be874d7831	html: Ignore unexpected DOCTYPE declarations	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	462bf0b7a5	html: Rework options Introduce htmlCtxtSetOptions, see similar changes made to XML parser. Add HTML_PARSE_HUGE alias. Support HTML_PARSE_BIG_LINES.	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	42c3823df0	html: Update comment	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	9f04cce695	html: Remove unused or useless return codes htmlParseStartTag should always succeed (except for malloc failures).	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	e179f3ec0e	html: Stop reporting syntax errors It doesn't make much sense to keep the old syntax error handling which doesn't conform to HTML5. Handling HTML5 parser errors is rather involved and not essential for parsers.	2024-10-06 20:04:00 +02:00
Nick Wellnhofer	27752f75ca	html: Fix EOF handling in start tags	2024-10-06 18:13:05 +02:00
Nick Wellnhofer	b19d353970	html: Fix EOF handling in comments	2024-10-06 18:13:05 +02:00
Nick Wellnhofer	17e56ac54a	html: Fix parsing of end tags	2024-10-06 18:13:05 +02:00

1 2 3 4 5 ...

518 Commits