Nick Wellnhofer
e179f3ec0e
html: Stop reporting syntax errors
...
It doesn't make much sense to keep the old syntax error handling which
doesn't conform to HTML5.
Handling HTML5 parser errors is rather involved and not essential for
parsers.
2024-10-06 20:04:00 +02:00
Nick Wellnhofer
27752f75ca
html: Fix EOF handling in start tags
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
b19d353970
html: Fix EOF handling in comments
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
17e56ac54a
html: Fix parsing of end tags
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
24a09033c9
html: Fix bogus end tags
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
bca6485476
html: Allow U+000C FORM FEED as whitespace
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
6edf1a645e
html: Fix DOCTYPE parsing
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
9678163f54
html: Don't check for valid XML characters
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
a6955c13c7
html: Parse numeric character references according to HTML5
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
4eeac30944
html: Start to fix EOF and U+0000 handling
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
e062a4a9b3
html: Add HTML5 parser option
...
This option passes tokenizer output directly to the SAX callbacks,
making it possible to test the tokenizer against the html5lib test
suite.
This will produce unbalanced calls to the startElement and endElement
callbacks, but it's the only way to support a SAX like interface for
HTML5. It can be used for filtering or rewriting HTML5, for example.
A HTML5 tree builder could then be implemented on top of the SAX
callbacks.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
17da54c522
html: Normalize newlines
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
341dc78f24
html: Deduplicate code in htmlCurrentChar
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
3adb396d87
html: Parse bogus comments instead of ignoring them
...
Also treat XML processing instructions as bogus comments.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
8444017578
html: Add missing calls to htmlCheckParagraph()
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
86d6b9b051
html: Deduplicate some code
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
0d324bde36
html: Simplify node info accounting
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
ccb61f599e
html: Remove duplicate calls to htmlAutoClose
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
f9ed30e972
html: HTML5 character data states
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
5951179239
html: Parse named character references according to HTML5
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
d5cd0f07f8
html: Prefer SKIP(1) over NEXT in HTML parser
...
Use SKIP(1) where it's safe to avoid a function call.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
dc2d498318
html: Rework htmlLookupSequence
...
Rename to htmlLookupString and use strstr for increased performance.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
637215a4de
html: Always terminate doctype declarations on '>'
...
Align with HTML5 spec. This allows to remove the old quote handling in
htmlLookupSequence.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
72e29f9a3d
html: Fix quadratic behavior in push parser
...
Fix quadratic behavior related to unquoted attribute values. We really
have to replicate parts of the HTML5 state machine to find the end of
tags relibably.
Fixes #533 .
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
a80f8b64a9
html: Allow attributes in end tags
...
Attribute are syntactically allowed in HTML5 end tags but otherwise
ignored.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
f2272c231b
html: Handle unexpected-solidus-in-tag according to HTML5
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
939b53ee12
html: Stop skipping tag content
...
Tag and attributes names should always be parsed succesfully now.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
dcb2abb2fe
html: Parse tag and attribute names according to HTML5
...
HTML5 allows bascially all characters in tag and attribute names.
2024-10-06 18:13:05 +02:00
Nick Wellnhofer
5d36664fc9
memory: Deprecate xmlGcMemSetup
2024-07-16 17:42:10 +02:00
Nick Wellnhofer
8af55c8d20
parser: Rename new input API functions
...
These weren't made public yet.
2024-07-11 01:33:29 +02:00
Nick Wellnhofer
d74ca59491
parser: Rename internal xmlNewInput functions
2024-07-11 01:31:50 +02:00
Nick Wellnhofer
4f329dc524
parser: Implement xmlCtxtParseContent
...
This implements xmlCtxtParseContent, a better alternative to
xmlParseInNodeContext or xmlParseBalancedChunkMemory. It accepts a
parser context and a parser input, making it a lot more versatile.
xmlParseInNodeContext is now implemented in terms of
xmlCtxtParseContent. This makes sure that xmlParseInNodeContext never
modifies the target document, improving thread safety.
xmlParseInNodeContext is also more lenient now with regard to undeclared
entities.
Fixes #727 .
2024-07-11 01:26:32 +02:00
Nick Wellnhofer
2e63656ec6
parser: Check return value of inputPush
...
inputPush typically doesn't fail because we pre-allocate the input
table. The return value should be checked nevertheless.
2024-07-08 11:27:52 +02:00
Nick Wellnhofer
fdfeecfe5e
parser: Reenable ctxt->directory
...
Unused internally, but used in downstream code.
Should fix #753 .
2024-07-02 22:06:53 +02:00
Nick Wellnhofer
30ef77554b
parser: Don't use deprecated xmlCopyChar
2024-07-02 13:34:11 +02:00
Nick Wellnhofer
dd8e378513
HTML: Rework UTF8ToHtml
...
Optimize code. Check for XML_ENC_ERR_SPACE. Use error macros.
2024-07-01 18:05:40 +02:00
Nick Wellnhofer
f505dcaea0
tree: Remove underscores from xmlRegisterCallbacks
2024-06-27 14:45:35 +02:00
Nick Wellnhofer
1112699cfa
legacy: Remove most legacy functions from public headers
...
Also remove warning messages.
2024-06-17 15:47:42 +02:00
Nick Wellnhofer
039ce1e821
parser: Pass global object to sax->setDocumentLocator
...
Revert part of commit c011e760.
Fixes #732 .
2024-06-14 16:41:43 +02:00
Nick Wellnhofer
89fcae4dfd
parser: Don't report malloc failures when creating context
...
We don't want messages to stderr before an error handler could be set on
a parser context.
2024-06-12 16:36:12 +02:00
Nick Wellnhofer
e75e878e02
doc: Update and fix documentation
2024-05-20 14:23:39 +02:00
Nick Wellnhofer
a4c2b7233f
io: Don't set close callback in xmlParserInputBufferCreateFd
2024-05-05 17:27:12 +02:00
Nick Wellnhofer
05654cfe00
html: Deprecate htmlHandleOmittedElem
2024-04-28 18:58:27 +02:00
Nick Wellnhofer
aa04838eab
html: Use binary search in htmlEntityValueLookup
2024-03-26 14:21:11 +01:00
Nick Wellnhofer
3efbe916a1
parser: Mark 'token' member as unused in xmlParserCtxt
2024-01-05 20:39:40 +01:00
Nick Wellnhofer
b82fd81d06
parser: Rework xmlCtxtParseDocument
...
Make xmlCtxtParseDocument take a parser input which can be popped after
parsing.
2024-01-05 20:39:40 +01:00
Nick Wellnhofer
7e0bbbc143
parser: New input API
...
Provide a new set of functions to create xmlParserInputs. These can be
used for the document entity or from external entity loaders.
- Don't require xmlParserInputBuffer.
- All functions take a base URI.
- All functions take an encoding as string.
- xmlNewInputURL also takes a public ID.
- xmlNewInputMemory takes a size_t.
- Optimization hints for memory buffers.
Improve documentation.
Only call xmlInitParser before allocating a new parser context.
Call xmlCtxtUseOptions as early as possible.
2023-12-29 01:22:13 +01:00
Nick Wellnhofer
6a9a88a17f
parser: Move progressive flag into input struct
2023-12-29 01:20:08 +01:00
Nick Wellnhofer
d944a41515
parser: Fix in-parameter-entity and in-external-dtd checks
...
Use in ctxt->input->entity instead of ctxt->inputNr to determine whether
we are inside a parameter entity.
Stop using ctxt->external to check whether we're in an external DTD.
This is signaled by ctxt->inSubset == 2.
2023-12-29 01:19:56 +01:00
Nick Wellnhofer
477a7ed82c
html: Abort earlier on fatal errors
2023-12-28 19:43:48 +01:00