1
0
mirror of https://gitlab.gnome.org/GNOME/libxml2.git synced 2026-01-26 21:41:34 +03:00
Commit Graph

668 Commits

Author SHA1 Message Date
Nick Wellnhofer
149c04c02d html: Escape < and > when serializing attributes
This reverts the change in cdaf657f. Coincidentally, the HTML spec just
changed to mandate the old escaping behavior:

https://github.com/whatwg/html/issues/6235

Fixes #957.
2025-08-02 15:03:18 +02:00
Nick Wellnhofer
0c948334a8 html: Add newline to error message 2025-07-10 12:46:40 +02:00
Nick Wellnhofer
bc0bb67b57 html: Don't abort on encoding errors
Always enable recovery mode when parsing HTML, so we don't raise fatal
errors.

Regressed with 462bf0b7. Fixes #947.
2025-07-10 12:46:22 +02:00
Nick Wellnhofer
71e1e8af5e schematron: Fix memory safety issues in xmlSchematronReportOutput
Fix use-after-free (CVE-2025-49794) and type confusion (CVE-2025-49796)
in xmlSchematronReportOutput.

Fixes #931.
Fixes #933.
2025-07-04 14:44:54 +02:00
Nick Wellnhofer
24d7e15914 schematron: Complete fix for CVE-2025-49795
- Fix memory leaks
- Fix tests
2025-07-04 12:46:29 +02:00
Michael Mann
499bcb78ab Schematron: Fix null pointer dereference leading to DoS
(CVE-2025-49795)

Fixes #932
2025-07-04 09:35:14 +00:00
Michael Mann
069bcda17d Fix potential buffer overflows of interactive shell
CVE-2025-6170

Fixes #941
2025-07-02 13:29:19 -04:00
Omar Siam
9760a14fb9 relaxng: In the simplification step also unlink notAllowed refs from choice
This fixes false reports of non allowed content compared to notAllowed as tag within the choice tag.
2025-06-30 13:47:33 +00:00
Nick Wellnhofer
ad0f5d27c4 tree: Fix xmlGetNodePath
- Fix quadratic behavior
- Don't truncate names

Fixes #715.
2025-06-24 13:57:20 +02:00
Nick Wellnhofer
ab06bfa1f6 parser: Fix error return in xmlParseElementContentDecl
Avoid internal error later in xmlValidBuildAContentModel after
2a60ca06c.

Also avoids some unnecessary error messages.
2025-05-26 16:51:59 +02:00
Nick Wellnhofer
5ec83f7741 valid: Remove duplicate #FIXED check for namespaces
Unlike the comment indicates, this is already checked.
2025-05-25 14:26:30 +02:00
Nick Wellnhofer
7c10fff265 valid: Don't validate twice in xmlAddAttributeDecl
This should only be done in xmlValidateAttributeDecl.
2025-05-25 14:26:30 +02:00
Nick Wellnhofer
2f3655c9c3 parser: Pop PEs that start markup declarations explicitly
We currently only handle "Validity constraint: Proper Declaration/PE
Nesting", but we must detect "Well-formedness constraint: PE Between
Declarations" separately:

> The replacement text of a parameter entity reference in a DeclSep must
> match the production extSubsetDecl.

PEs in DeclSeps are PEs that start with a full markup declaration (or
another PE). These are handled in xmParse{Internal|External}Subset. We
set a flag on these PEs and don't close them implicitly in
xmlSkipBlankCharsPE. This will make unterminated declarations in such
PEs cause a parser error. The PEs are closed explicitly in
xmParse{Internal|External}Subset, the only location where they are
allowed to end.
2025-05-25 14:26:30 +02:00
Nick Wellnhofer
dd1961e0d8 valid: Skip more validity checks if not validating 2025-05-25 14:26:30 +02:00
Nick Wellnhofer
3a68d0b7a8 SAX2: Handle xml:id errors separately 2025-05-19 20:07:54 +02:00
Nick Wellnhofer
87087def4e tests: Remove result files committed by accident 2025-05-13 23:00:51 +02:00
Nick Wellnhofer
f0983199e8 html: Map some encodings according to HTML5
Windows-1252 is a superset of ISO-8859-1 and should be used instead.
Same for ASCII.

Also map UCS-2 and UTF-16 to UTF-16LE.
2025-05-12 14:04:30 +02:00
Nick Wellnhofer
825f3a9d0c html: Always serialize attributes with double quotes
Align with HTML5.
2025-05-11 21:42:51 +02:00
Nick Wellnhofer
cdaf657ffb html: Don't escape < and > when serializing attribute values
Align with HTML5.

This will break some test suites.
2025-05-11 20:29:25 +02:00
Nick Wellnhofer
c8cea39d8a save: Fix serialization of attribute defaults containing &lt;
Long-standing bug that produced invalid XML.
2025-05-11 20:29:25 +02:00
Nick Wellnhofer
46f05ea4d5 html: Rework meta charset handling
Don't use encoding from meta tags when serializing. Only use the value
in `doc->encoding`, matching the XML serializer. This is the actual
encoding used when parsing.

Stop modifying the input document by setting meta tags before
serializing. Meta tags are now injected during serialization.

Add full support for <meta charset=""> which is also used when adding
meta tags.

Align with HTML5 and implement the "algorithm for extracting a character
encoding from a meta element". Only modify the encoding substring in
Content-Type meta tags.

Only switch encoding once when parsing.

Fix htmlSaveFileFormat with a NULL encoding not to declare a misleading
UTF-8 charset.

Fixes #909.
2025-05-11 20:29:25 +02:00
Nick Wellnhofer
f3a080bc48 html: Ignore U+0000 in body text
Align with HTML5. Fixes #908.
2025-05-11 20:29:25 +02:00
Nick Wellnhofer
6896f478d4 Revert "valid: Remove duplicate error messages when streaming"
This reverts commit cd220b93d8.

This commit broke the xmstarlet tests.
2025-04-18 17:24:45 +02:00
Nick Wellnhofer
69b83bb68e encoding: Detect truncated multi-byte sequences with ICU
Unlike iconv or the internal converters, ICU consumes truncated multi-
byte sequences at the end of an input buffer. We currently check for a
non-empty raw input buffer to detect truncated sequences, so this fails
with ICU.

It might be possible to inspect the pivot buffer pointers, but it seems
cleaner to implement a `flush` flag for some encoding and I/O functions.
After flushing, we can check for U_TRUNCATED_CHAR_FOUND with ICU, or
detect remaining input with other converters.

Also fix detection of truncated sequences for HTML, XML content and
DTDs with iconv.
2025-03-13 22:15:10 +01:00
Nick Wellnhofer
05bd1720ce parser: Fix parsing of DTD content
Regressed in 2.11. Fixes #868.
2025-03-01 15:18:20 +01:00
Nick Wellnhofer
9f86dae989 test: Add test case for UAF in xmlSchemaIDCFillNodeTables 2025-02-20 11:35:47 +01:00
Nick Wellnhofer
8cf6129bbd html: Stop implying <p> start tags
Only <html>, <head> or <body> should be implied. Opening extra <p> tags
has always been a libxml2 quirk.
2025-02-13 20:20:17 +01:00
Nick Wellnhofer
71122421a1 html: Make implied <p> tags more deterministic
libxml2's HTML parser adds <p> start tags in some situations. This
behavior, which doesn't follow any standard, was added in 2000, see
here: http://veillard.com/XML/messages/0655.html

Text nodes that only contain whitespace don't imply a <p> tag, but the
whitespace check cannot work reliably if we're parsing partial text data
which can happen with both pull and push parser.

The logic in `areBlanks` is hard to follow. The checks involving `CUR`
depend on the position of the input pointer and seem dubious. It's also
possible that the behavior changed inadvertently with a later commit.
As a result, it's hard to come up with good test cases.

We now process leading whitespace before creating implied tags. This is
more in line with HTML5 and should avoid at least some issues with
partial text data.

For example, parsing the string "<head>   x" used to result in:

<html>
<head></head>
<body><p>   x</p></body>
</html>

And now results in:

<html>
<head>   </head>
<body><p>x</p></body>
</html>

Except for the implied <p> tag, this matches HTML5.
2025-02-13 14:31:44 +01:00
Nick Wellnhofer
b4d3d87ed2 parser: Fix parsing of doctype declarations
Fix some long-standing issues.

Fixes #504.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
080285724b html: Make data parsing modes work with push parser
This can't be solved with a simple scan for a terminator. Instead, we
make htmlParseCharData handle incomplete data if the "partial" flag is
set.
2025-02-02 11:15:45 +01:00
Nick Wellnhofer
cd220b93d8 valid: Remove duplicate error messages when streaming 2024-12-28 11:55:24 +01:00
Nick Wellnhofer
459146140a xpath: Fix parsing of non-ASCII names
Fix a long-standing issue where QNames starting with a non-ASCII
character would be rejected. This became more visible after "streaming"
XPath evaluation was disabled since the latter handled non-ASCII names
correctly.

Fixes #818.
2024-11-05 12:30:44 +01:00
Nick Wellnhofer
ffb058f484 parser: Fix detection of duplicate attributes
We really need a second scan if more than one namespace clash was
detected.
2024-10-28 20:26:55 +01:00
Nick Wellnhofer
f77ec16db0 html: Optimize htmlParseCharData 2024-10-06 20:04:00 +02:00
Nick Wellnhofer
575be6c1f1 html: Fix line numbers with CRs 2024-10-06 20:04:00 +02:00
Nick Wellnhofer
e179f3ec0e html: Stop reporting syntax errors
It doesn't make much sense to keep the old syntax error handling which
doesn't conform to HTML5.

Handling HTML5 parser errors is rather involved and not essential for
parsers.
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
c6af101728 html: Test tokenizer against html5lib test suite 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
9678163f54 html: Don't check for valid XML characters 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
4eeac30944 html: Start to fix EOF and U+0000 handling 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
17da54c522 html: Normalize newlines 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
3adb396d87 html: Parse bogus comments instead of ignoring them
Also treat XML processing instructions as bogus comments.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
e1834745e0 html: Add character data tests 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
f9ed30e972 html: HTML5 character data states 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
5951179239 html: Parse named character references according to HTML5 2024-10-06 18:13:05 +02:00
Nick Wellnhofer
a80f8b64a9 html: Allow attributes in end tags
Attribute are syntactically allowed in HTML5 end tags but otherwise
ignored.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
dcb2abb2fe html: Parse tag and attribute names according to HTML5
HTML5 allows bascially all characters in tag and attribute names.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
bd9eed4694 parser: Make unsupported encodings an error in declarations
This was changed in 45157261, but in encoding declarations, unsupported
encodings should raise a fatal error.

Fixes #794.
2024-09-02 19:29:39 +02:00
Nick Wellnhofer
8ae06d5223 SAX2: Don't merge CDATA sections
The Document Object Model (DOM) Level 3 Core Specification says:

> Adjacent CDATASection nodes are not merged by use of the normalize
> method of the Node interface.

Fixes #412.
2024-08-29 01:31:19 +02:00
Nick Wellnhofer
322e733b84 xinclude: Fix fallback for text includes
Fixes #772.
2024-07-18 19:32:23 +02:00
Nick Wellnhofer
842a044831 valid: Restore ID lookup
Revert a change from d025cfbb and don't overwrite ID table entries, so
that the first attribute will be returned if there are duplicate IDs.

This requires two other changes:

- Attributes in entity content are never added to the ID table. This
  seems reasonable.

- Remove the optimization to skip ID lookup when copying and the target
  document has an empty ID table. This also seems more correct since the
  document could have ID declarations nevertheless or we could be
  copying xml:ids into the document for the first time.

Fixes #757.
2024-07-03 11:46:06 +02:00