libxml2

mirror of https://gitlab.gnome.org/GNOME/libxml2.git synced 2025-10-21 14:53:44 +03:00

Author	SHA1	Message	Date
Nick Wellnhofer	416da89d0b	html: Make htmlCtxtReset call xmlCtxtReset The two implementations shouldn't diverge.	2025-06-08 14:22:32 +02:00
Nick Wellnhofer	c6206c9387	html: Ignore ASCII-incompatible encoding in meta tag After successfully parsing an ASCII-encoded meta tag, switching to an encoding that isn't ASCII-compatible cannot work.	2025-06-05 22:24:50 +02:00
Nick Wellnhofer	6a6a46f017	doc: Fix autolink errors Fix links, remove links to internal functions.	2025-05-28 16:02:41 +02:00
Nick Wellnhofer	7bd8d1d9cc	doc: Prefix autolinks with '#' Use `#func` instead of `func()` to ignore parameters and make all autolinks work.	2025-05-28 16:01:52 +02:00
Nick Wellnhofer	c5b45fbc07	doc: Misc fixes	2025-05-16 19:04:20 +02:00
Nick Wellnhofer	6f4b452742	parser: Stop using ctxt->linenumbers I think this was used to avoid setting the `line` member before it was added (20+ years ago).	2025-05-16 18:03:12 +02:00
Nick Wellnhofer	258d870629	codegen: Consolidate tools for code generation Move tools, source files and output tables into codegen directory. Rename some files. Adjust tools to match modified files. Remove generation date and source files from output. Distribute all tools and sources.	2025-05-16 18:03:12 +02:00
Nick Wellnhofer	adfbeb7e08	doc: Stop using *Ptr typedefs in documentation	2025-05-16 18:03:12 +02:00
Nick Wellnhofer	a40f36e7f2	include: Stop using *Ptr typedefs in public headers	2025-05-16 18:03:12 +02:00
Nick Wellnhofer	2d83a84ca6	doc: Misc improvements	2025-05-16 18:03:12 +02:00
Nick Wellnhofer	f0983199e8	html: Map some encodings according to HTML5 Windows-1252 is a superset of ISO-8859-1 and should be used instead. Same for ASCII. Also map UCS-2 and UTF-16 to UTF-16LE.	2025-05-12 14:04:30 +02:00
Nick Wellnhofer	05b8fe0a06	html: Don't escape RAWTEXT and PLAINTEXT Align with HTML5.	2025-05-11 20:57:07 +02:00
Nick Wellnhofer	809ded586b	html: Add more empty elements Add empty HTML5 elements <bgsound>, <keygen>, <source>, <track> and <wbr>. Make <embed> an empty element.	2025-05-11 20:46:50 +02:00
Nick Wellnhofer	c7c4964342	html: Move DTD creation to endDocument SAX callback	2025-05-11 20:29:25 +02:00
Nick Wellnhofer	46f05ea4d5	html: Rework meta charset handling Don't use encoding from meta tags when serializing. Only use the value in `doc->encoding`, matching the XML serializer. This is the actual encoding used when parsing. Stop modifying the input document by setting meta tags before serializing. Meta tags are now injected during serialization. Add full support for <meta charset=""> which is also used when adding meta tags. Align with HTML5 and implement the "algorithm for extracting a character encoding from a meta element". Only modify the encoding substring in Content-Type meta tags. Only switch encoding once when parsing. Fix htmlSaveFileFormat with a NULL encoding not to declare a misleading UTF-8 charset. Fixes #909.	2025-05-11 20:29:25 +02:00
Nick Wellnhofer	f3a080bc48	html: Ignore U+0000 in body text Align with HTML5. Fixes #908.	2025-05-11 20:29:25 +02:00
Nick Wellnhofer	9bbffec568	doc: Move brief to top, params to bottom of doc comments	2025-05-06 19:51:38 +02:00
Nick Wellnhofer	b7274fb02f	doc: Misc fixes to HTML parser docs	2025-05-06 19:51:38 +02:00
Nick Wellnhofer	4a01087585	doc: Move parser option docs to enum	2025-05-06 19:51:38 +02:00
Nick Wellnhofer	cb1635a642	doc: Use @since command	2025-05-02 19:05:25 +02:00
Nick Wellnhofer	e78e05c990	doc: Fix autolinks to functions Unfortunately, autolinks in .c files aren't converted by Doxygen for some reason.	2025-05-02 17:45:31 +02:00
Nick Wellnhofer	f7c412874b	doc: Remove more comment block headers	2025-05-02 17:41:26 +02:00
Nick Wellnhofer	e525564f65	doc: Remove empty lines at start of block These lines were left over after automatic conversion.	2025-05-02 11:42:05 +02:00
Nick Wellnhofer	e549622bc5	doc: Convert documentation to Doxygen Automated conversion based on a few regexes.	2025-05-01 23:23:42 +02:00
Nick Wellnhofer	69879da88f	doc: Remove email addresses from documentation Also remove authorship information from generated files, hash.c and globals.c which were rewritten.	2025-05-01 23:23:42 +02:00
Nick Wellnhofer	61890e399d	doc: Prepare for conversion to Doxygen Fix many params in internal functions (not really necessary but Doxygen warns about that in XML mode). Fix formatting in a few corner cases that automatic conversion can't handle. Rearrange some DOC_DISABLE blocks.	2025-05-01 23:23:42 +02:00
Nick Wellnhofer	4ba1f9238a	html: Avoid HTML_PARSE_HTML5 clashing with XML_PARSE_NOENT There are several users that pass invalid XML parser options to the HTML parser. Choose a value that is less likely to clash.	2025-04-18 18:48:25 +02:00
Nick Wellnhofer	b8018afa4c	html: Fix documentation of parser options	2025-04-10 16:36:03 +02:00
Nick Wellnhofer	2ecc08f6dc	html: Deprecate more functions	2025-04-10 16:36:03 +02:00
Nick Wellnhofer	69b83bb68e	encoding: Detect truncated multi-byte sequences with ICU Unlike iconv or the internal converters, ICU consumes truncated multi- byte sequences at the end of an input buffer. We currently check for a non-empty raw input buffer to detect truncated sequences, so this fails with ICU. It might be possible to inspect the pivot buffer pointers, but it seems cleaner to implement a `flush` flag for some encoding and I/O functions. After flushing, we can check for U_TRUNCATED_CHAR_FOUND with ICU, or detect remaining input with other converters. Also fix detection of truncated sequences for HTML, XML content and DTDs with iconv.	2025-03-13 22:15:10 +01:00
Nick Wellnhofer	8873a49846	html: Fix areBlanks check Short-lived regression from `71122421`.	2025-03-09 16:21:13 +01:00
Nick Wellnhofer	5f0b1378d7	parser: Add more parser context accessors Fixes #763.	2025-03-08 22:36:06 +01:00
Nick Wellnhofer	5237d90fae	html: Process data before switching encoding This reduces the amount of data to convert and avoids issues with EOF detection. Also reset EOF flag after switching encoding as a precaution.	2025-03-07 21:19:16 +01:00
Nick Wellnhofer	0b27097a92	encoding: Rename unprefixed public functions	2025-03-04 16:46:53 +01:00
Nick Wellnhofer	5ed4eafd8a	html: Don't invoke SAX callbacks if parser was stopped	2025-02-22 14:52:47 +01:00
Nick Wellnhofer	63dfcca670	fuzz: Reduce initial array size	2025-02-20 12:22:12 +01:00
Nick Wellnhofer	b8234e8c73	html: Fix check for partial named character references Digits are allowed after the first character.	2025-02-19 12:53:32 +01:00
Nick Wellnhofer	7a61c32bfa	html: Use enum instead of magic values for insertion modes	2025-02-17 11:41:57 +01:00
Nick Wellnhofer	8cf6129bbd	html: Stop implying <p> start tags Only <html>, <head> or <body> should be implied. Opening extra <p> tags has always been a libxml2 quirk.	2025-02-13 20:20:17 +01:00
Nick Wellnhofer	71122421a1	html: Make implied <p> tags more deterministic libxml2's HTML parser adds <p> start tags in some situations. This behavior, which doesn't follow any standard, was added in 2000, see here: http://veillard.com/XML/messages/0655.html Text nodes that only contain whitespace don't imply a <p> tag, but the whitespace check cannot work reliably if we're parsing partial text data which can happen with both pull and push parser. The logic in `areBlanks` is hard to follow. The checks involving `CUR` depend on the position of the input pointer and seem dubious. It's also possible that the behavior changed inadvertently with a later commit. As a result, it's hard to come up with good test cases. We now process leading whitespace before creating implied tags. This is more in line with HTML5 and should avoid at least some issues with partial text data. For example, parsing the string "<head> x" used to result in: <html> <head></head> <body><p> x</p></body> </html> And now results in: <html> <head> </head> <body><p>x</p></body> </html> Except for the implied <p> tag, this matches HTML5.	2025-02-13 14:31:44 +01:00
Nick Wellnhofer	8d7e38d536	fuzz: Ignore encodings when fuzzing on Apple Not long ago, Apple decided to replace GNU libiconv with a patched up version of FreeBSD's iconv implementation in their operating systems. Unfortunately, the quality of both the original implementation as well as Apple's patches is so abysmal that you routinely find issues when fuzzing your own code.	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	68be036f29	fuzz: Disable HTML encoding detection for now This doesn't work with the push parser.	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	c13fcc1910	html: Chunk text data in push parser Follow the logic of the XML parser and chunk large text nodes.	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	080285724b	html: Make data parsing modes work with push parser This can't be solved with a simple scan for a terminator. Instead, we make htmlParseCharData handle incomplete data if the "partial" flag is set.	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	4be1e8befb	html: Simplify htmlParseTryOrFinish a little	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	12732592ef	html: Remove unused epilog state	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	70bf754e24	html: Fix pull-parsing of incomplete end tags Handle this HTML5 quirk in htmlParseEndTag.	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	4a776c78ec	html: Use htmlParseElementInternal in push parser	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	ba1537374b	html: Fix corner case when push-parsing HTML5 comments	2025-02-02 11:15:45 +01:00
Nick Wellnhofer	e48fb5e4f2	html: Handle incomplete UTF-8 when push-parsing For now, incomplete UTF-8 is always an error in push mode. Eventually, we could pass chunked data to the character handler when push-parsing. Then we'd have to handle incomplete sequences.	2025-02-02 11:15:45 +01:00

1 2 3 4 5 ...

548 Commits