1
0
mirror of https://gitlab.gnome.org/GNOME/libxml2.git synced 2026-01-26 21:41:34 +03:00
Commit Graph

659 Commits

Author SHA1 Message Date
Nick Wellnhofer
ab06bfa1f6 parser: Fix error return in xmlParseElementContentDecl
Avoid internal error later in xmlValidBuildAContentModel after
2a60ca06c.

Also avoids some unnecessary error messages.
2025-05-26 16:51:59 +02:00
Nick Wellnhofer
5ec83f7741 valid: Remove duplicate #FIXED check for namespaces
Unlike the comment indicates, this is already checked.
2025-05-25 14:26:30 +02:00
Nick Wellnhofer
7c10fff265 valid: Don't validate twice in xmlAddAttributeDecl
This should only be done in xmlValidateAttributeDecl.
2025-05-25 14:26:30 +02:00
Nick Wellnhofer
2f3655c9c3 parser: Pop PEs that start markup declarations explicitly
We currently only handle "Validity constraint: Proper Declaration/PE
Nesting", but we must detect "Well-formedness constraint: PE Between
Declarations" separately:

> The replacement text of a parameter entity reference in a DeclSep must
> match the production extSubsetDecl.

PEs in DeclSeps are PEs that start with a full markup declaration (or
another PE). These are handled in xmParse{Internal|External}Subset. We
set a flag on these PEs and don't close them implicitly in
xmlSkipBlankCharsPE. This will make unterminated declarations in such
PEs cause a parser error. The PEs are closed explicitly in
xmParse{Internal|External}Subset, the only location where they are
allowed to end.
2025-05-25 14:26:30 +02:00
Nick Wellnhofer
dd1961e0d8 valid: Skip more validity checks if not validating 2025-05-25 14:26:30 +02:00
Nick Wellnhofer
3a68d0b7a8 SAX2: Handle xml:id errors separately 2025-05-19 20:07:54 +02:00
Nick Wellnhofer
87087def4e tests: Remove result files committed by accident 2025-05-13 23:00:51 +02:00
Nick Wellnhofer
f0983199e8 html: Map some encodings according to HTML5
Windows-1252 is a superset of ISO-8859-1 and should be used instead.
Same for ASCII.

Also map UCS-2 and UTF-16 to UTF-16LE.
2025-05-12 14:04:30 +02:00
Nick Wellnhofer
825f3a9d0c html: Always serialize attributes with double quotes
Align with HTML5.
2025-05-11 21:42:51 +02:00
Nick Wellnhofer
cdaf657ffb html: Don't escape < and > when serializing attribute values
Align with HTML5.

This will break some test suites.
2025-05-11 20:29:25 +02:00
Nick Wellnhofer
c8cea39d8a save: Fix serialization of attribute defaults containing &lt;
Long-standing bug that produced invalid XML.
2025-05-11 20:29:25 +02:00
Nick Wellnhofer
46f05ea4d5 html: Rework meta charset handling
Don't use encoding from meta tags when serializing. Only use the value
in `doc->encoding`, matching the XML serializer. This is the actual
encoding used when parsing.

Stop modifying the input document by setting meta tags before
serializing. Meta tags are now injected during serialization.

Add full support for <meta charset=""> which is also used when adding
meta tags.

Align with HTML5 and implement the "algorithm for extracting a character
encoding from a meta element". Only modify the encoding substring in
Content-Type meta tags.

Only switch encoding once when parsing.

Fix htmlSaveFileFormat with a NULL encoding not to declare a misleading
UTF-8 charset.

Fixes #909.
2025-05-11 20:29:25 +02:00
Nick Wellnhofer
f3a080bc48 html: Ignore U+0000 in body text
Align with HTML5. Fixes #908.
2025-05-11 20:29:25 +02:00
Nick Wellnhofer
6896f478d4 Revert "valid: Remove duplicate error messages when streaming"
This reverts commit cd220b93d8.

This commit broke the xmstarlet tests.
2025-04-18 17:24:45 +02:00
Nick Wellnhofer
69b83bb68e encoding: Detect truncated multi-byte sequences with ICU
Unlike iconv or the internal converters, ICU consumes truncated multi-
byte sequences at the end of an input buffer. We currently check for a
non-empty raw input buffer to detect truncated sequences, so this fails
with ICU.

It might be possible to inspect the pivot buffer pointers, but it seems
cleaner to implement a `flush` flag for some encoding and I/O functions.
After flushing, we can check for U_TRUNCATED_CHAR_FOUND with ICU, or
detect remaining input with other converters.

Also fix detection of truncated sequences for HTML, XML content and
DTDs with iconv.
2025-03-13 22:15:10 +01:00
Nick Wellnhofer
05bd1720ce parser: Fix parsing of DTD content
Regressed in 2.11. Fixes #868.
2025-03-01 15:18:20 +01:00
Nick Wellnhofer
9f86dae989 test: Add test case for UAF in xmlSchemaIDCFillNodeTables 2025-02-20 11:35:47 +01:00
Nick Wellnhofer
8cf6129bbd html: Stop implying <p> start tags
Only <html>, <head> or <body> should be implied. Opening extra <p> tags
has always been a libxml2 quirk.
2025-02-13 20:20:17 +01:00
Nick Wellnhofer
71122421a1 html: Make implied <p> tags more deterministic
libxml2's HTML parser adds <p> start tags in some situations. This
behavior, which doesn't follow any standard, was added in 2000, see
here: http://veillard.com/XML/messages/0655.html

Text nodes that only contain whitespace don't imply a <p> tag, but the
whitespace check cannot work reliably if we're parsing partial text data
which can happen with both pull and push parser.

The logic in `areBlanks` is hard to follow. The checks involving `CUR`
depend on the position of the input pointer and seem dubious. It's also
possible that the behavior changed inadvertently with a later commit.
As a result, it's hard to come up with good test cases.

We now process leading whitespace before creating implied tags. This is
more in line with HTML5 and should avoid at least some issues with
partial text data.

For example, parsing the string "<head>   x" used to result in:

<html>
<head></head>
<body><p>   x</p></body>
</html>

And now results in:

<html>
<head>   </head>
<body><p>x</p></body>
</html>

Except for the implied <p> tag, this matches HTML5.
2025-02-13 14:31:44 +01:00
Nick Wellnhofer
b4d3d87ed2 parser: Fix parsing of doctype declarations
Fix some long-standing issues.

Fixes #504.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
080285724b html: Make data parsing modes work with push parser
This can't be solved with a simple scan for a terminator. Instead, we
make htmlParseCharData handle incomplete data if the "partial" flag is
set.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
cd220b93d8 valid: Remove duplicate error messages when streaming 2024-12-28 11:55:24 +01:00
Nick Wellnhofer
459146140a xpath: Fix parsing of non-ASCII names
Fix a long-standing issue where QNames starting with a non-ASCII
character would be rejected. This became more visible after "streaming"
XPath evaluation was disabled since the latter handled non-ASCII names
correctly.

Fixes #818.
2024-11-05 12:30:44 +01:00
Nick Wellnhofer
ffb058f484 parser: Fix detection of duplicate attributes
We really need a second scan if more than one namespace clash was
detected.
2024-10-28 20:26:55 +01:00
Nick Wellnhofer
f77ec16db0 html: Optimize htmlParseCharData 2024-10-06 20:04:00 +02:00
Nick Wellnhofer
575be6c1f1 html: Fix line numbers with CRs 2024-10-06 20:04:00 +02:00
Nick Wellnhofer
e179f3ec0e html: Stop reporting syntax errors
It doesn't make much sense to keep the old syntax error handling which
doesn't conform to HTML5.

Handling HTML5 parser errors is rather involved and not essential for
parsers.
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
c6af101728 html: Test tokenizer against html5lib test suite 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
9678163f54 html: Don't check for valid XML characters 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
4eeac30944 html: Start to fix EOF and U+0000 handling 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
17da54c522 html: Normalize newlines 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
3adb396d87 html: Parse bogus comments instead of ignoring them
Also treat XML processing instructions as bogus comments.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
e1834745e0 html: Add character data tests 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
f9ed30e972 html: HTML5 character data states 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
5951179239 html: Parse named character references according to HTML5 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
a80f8b64a9 html: Allow attributes in end tags
Attribute are syntactically allowed in HTML5 end tags but otherwise
ignored.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
dcb2abb2fe html: Parse tag and attribute names according to HTML5
HTML5 allows bascially all characters in tag and attribute names.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
bd9eed4694 parser: Make unsupported encodings an error in declarations
This was changed in 45157261, but in encoding declarations, unsupported
encodings should raise a fatal error.

Fixes #794.
2024-09-02 19:29:39 +02:00
Nick Wellnhofer
8ae06d5223 SAX2: Don't merge CDATA sections
The Document Object Model (DOM) Level 3 Core Specification says:

> Adjacent CDATASection nodes are not merged by use of the normalize
> method of the Node interface.

Fixes #412.
2024-08-29 01:31:19 +02:00
Nick Wellnhofer
322e733b84 xinclude: Fix fallback for text includes
Fixes #772.
2024-07-18 19:32:23 +02:00
Nick Wellnhofer
842a044831 valid: Restore ID lookup
Revert a change from d025cfbb and don't overwrite ID table entries, so
that the first attribute will be returned if there are duplicate IDs.

This requires two other changes:

- Attributes in entity content are never added to the ID table. This
  seems reasonable.

- Remove the optimization to skip ID lookup when copying and the target
  document has an empty ID table. This also seems more correct since the
  document could have ID declarations nevertheless or we could be
  copying xml:ids into the document for the first time.

Fixes #757.
2024-07-03 11:46:06 +02:00
Nick Wellnhofer
30be984a0f encoding: Rework ISO-8859-X conversion
Optimize code. Pass tables as context parameter. Check for
XML_ENC_ERR_SPACE.
2024-07-01 18:05:40 +02:00
Nick Wellnhofer
7c11da2d98 tests: Clarify licence of test/intsubset2.xml 2024-06-27 12:49:06 +02:00
Nick Wellnhofer
b8903b9e0d runtest: Remove result handling from schemasOneTest
We only care about errors.
2024-06-22 21:59:03 +02:00
Nick Wellnhofer
e68ccfa988 tests: Port Schematron tests to C 2024-06-22 21:59:03 +02:00
Nick Wellnhofer
1dd5e76a69 xinclude: Don't remove root element
Don't replace include element at root with empty nodeset.
2024-06-18 20:12:03 +02:00
Nick Wellnhofer
52ce0d70f9 tests: Add XInclude test for issue #733 2024-06-17 17:35:12 +02:00
Nick Wellnhofer
2608baaf92 parser: Make failure to load main document a warning
Revert the change that made failures to load the main document an error.

This fixes the --path option of xmllint and xsltproc.

Should fix #733.
2024-06-14 20:06:07 +02:00
Nick Wellnhofer
669bd34993 xpointer: Remove support for XPointer locations
The latest spec for what it essentially an XPath extension seems to be
this working draft from 2002:

    https://www.w3.org/TR/xptr-xpointer/

The xpointer() scheme is listed as "being reviewed" in the XPointer
registry since at least 2006. libxml2 seems to be the only modern
software that tries to implement this spec, but the code has many bugs
and quality issues.

If you configure --with-legacy, old symbols are retained for ABI
compatibility.
2024-06-12 18:20:01 +02:00
Nick Wellnhofer
4fefba4cf6 parser: Rework handling of undeclared entities
Throw an error if entity substitution was requested.

Now we only downgrade to a warning if

- XML_PARSE_DTDLOAD wasn't specified, and
- entity aren't substituted or XML_PARSE_NO_XXE was specified.

Should fix #724.
2024-05-15 17:58:48 +02:00