1
0
mirror of https://gitlab.gnome.org/GNOME/libxml2.git synced 2025-10-24 13:33:01 +03:00
Commit Graph

214 Commits

Author SHA1 Message Date
Nick Wellnhofer
3ff6abbf58 encoding: Rework error codes
Use an enum instead of magic numbers. Fix a few error codes. Simplify
handling of "space" and "partial" errors.

See #506.
2023-04-30 16:43:29 +02:00
Nick Wellnhofer
33fb297b36 encoding: Fix compiler warning in ICU build 2023-04-17 14:59:47 +02:00
Nick Wellnhofer
a6b9e55a9e encoding: Fix error code in asciiToUTF8
Use correct error code when invalid ASCII bytes are encountered.

Found by OSS-Fuzz.
2023-03-26 15:42:02 +02:00
Nick Wellnhofer
98840d40da parser: Rework EBCDIC code page detection
To detect EBCDIC code pages, we used to switch the encoding twice and
had to be very careful not to decode data after the XML declaration
before the second switch. This relied on a hard-coded expected size of
the XML declaration and was complicated and unreliable.

Now we convert the first 200 bytes to EBCDIC-US and parse the encoding
declaration manually.
2023-03-21 21:35:15 +01:00
Nick Wellnhofer
1c5e1fc194 malloc-fail: Check for malloc failure in xmlFindCharEncodingHandler
Don't return encoding handlers with a NULL name.

Found with libFuzzer, see #344.
2023-02-17 17:16:50 +01:00
Nick Wellnhofer
d18f9c1102 malloc-fail: Fix leak of xmlCharEncodingHandler
Also free handler if its name is NULL.

Found with libFuzzer, see #344.
2023-02-17 17:16:50 +01:00
Nick Wellnhofer
3cc900f098 encoding: Cast toupper argument to unsigned char
Fixes undefined behavior.

Also cast return value explicitly to fix implicit-integer-sign-change
checks.
2023-02-17 17:16:50 +01:00
Nick Wellnhofer
2355eac59e malloc-fail: Fix null deref if growing input buffer fails
Also add some error checks.

Found with libFuzzer, see #344.
2023-01-24 11:32:15 +01:00
Nick Wellnhofer
0f54af7494 encoding.c: Fix for documentation generator
Top-level macro invocations throw off the documentation parser.
2022-12-08 18:40:58 +01:00
Nick Wellnhofer
53ab38408d encoding: Make init function private 2022-11-27 02:11:07 +01:00
Nick Wellnhofer
3e9d5e4f7f encoding: Remove unused variable xmlDefaultCharEncodingHandler 2022-11-27 02:11:07 +01:00
Nick Wellnhofer
1406b20fe9 encoding: Allocate default handlers statically 2022-11-24 19:21:01 +01:00
Nick Wellnhofer
2059df5358 buf: Deprecate static/immutable buffers 2022-11-20 21:16:03 +01:00
Nick Wellnhofer
ad338ca737 Remove explicit integer casts
Remove explicit integer casts as final operation

- in assignments
- when passing arguments
- when returning values

Remove casts

- to the same type
- from certain range-bound values

The main motivation is that these explicit casts don't change the result
of operations and only render UBSan's implicit-conversion checks
useless. Removing these casts allows UBSan to detect cases where
truncation or sign-changes occur unexpectedly.

Document some explicit casts as truncating and add a few missing ones.
2022-09-01 02:33:57 +02:00
Nick Wellnhofer
0f568c0b73 Consolidate private header files
Private functions were previously declared

- in header files in the root directory
- in public headers guarded with IN_LIBXML
- in libxml.h
- redundantly in source files that used them.

Consolidate all private header files in include/private.
2022-08-26 02:11:56 +02:00
David Kilzer
c14cac8bba xmlBufAvail() should return length without including a byte for NUL terminator
* buf.c:
(xmlBufAvail):
- Return the number of bytes available in the buffer, but do not
  include a byte for the NUL terminator so that it is reserved.

* encoding.c:
(xmlCharEncFirstLineInput):
(xmlCharEncInput):
(xmlCharEncOutput):
* xmlIO.c:
(xmlOutputBufferWriteEscape):
- Remove code that subtracts 1 from the return value of
  xmlBufAvail().  It was implemented inconsistently anyway.
2022-05-25 18:25:19 -07:00
David Kilzer
21561e833a Mark more static data as const
Similar to 8f5710379, mark more static data structures with
`const` keyword.

Also fix placement of `const` in encoding.c.

Original patch by Sarah Wilkin.
2022-04-07 12:01:23 -07:00
Nick Wellnhofer
40483d0ce2 Deprecate module init and cleanup functions
These functions shouldn't be part of the public API. Most init
functions are only thread-safe when called from xmlInitParser. Global
variables should only be cleaned up by calling xmlCleanupParser.
2022-03-06 15:59:43 +01:00
Nick Wellnhofer
f2072a8b2f Fix memory leak in xmlFindCharEncodingHandler
Fix memory leak in an unlikely error condition. Thanks to Wentao Liang
for the report.

Fixes #342.
2022-03-05 18:27:12 +01:00
Nick Wellnhofer
21ddad5284 Remove ICONV_CONST test
We can simply cast the offending pointer to (void *).
2022-03-04 22:08:58 +01:00
Nick Wellnhofer
776d15d383 Don't check for standard C89 headers
Don't check for

- ctype.h
- errno.h
- float.h
- limits.h
- math.h
- signal.h
- stdarg.h
- stdlib.h
- string.h
- time.h

Stop including non-standard headers

- malloc.h
- strings.h
2022-03-02 00:43:54 +01:00
Nick Wellnhofer
b66ce0bba8 Don't include ICU headers in public headers
There's no need to make these implementation details public.
2022-03-01 13:02:49 +01:00
Nick Wellnhofer
c41bc10da3 Fix unused variable warnings with disabled features 2022-02-22 19:57:12 +01:00
Nick Wellnhofer
346c3a930c Remove elfgcchack.h
The same optimization can be enabled with -fno-semantic-interposition
since GCC 5. clang has always used this option by default.
2022-02-20 21:49:04 +01:00
Nick Wellnhofer
7abc6e6a24 Fix integer conversion warning in xmlIconvWrapper
Use size_t for return value of iconv(3) to avoid an UBSan integer
conversion warning.
2022-01-25 03:07:30 +01:00
Mohammad Razavi
eb4c1bf855 Fix random dropping of characters on dumping ASCII encoded XML
Fix a bug in xmlCharEncOutput return value which will cause
xmlNodeDumpOutput to drop characters randomly.

xmlCharEncOutput returns zero if the length of the input buffer is
zero but ignores the fact that it may already encoded the input buffer
and the input's length is zero due to the fact that xmlEncOutputChunk
returned -2 errors and underlying code tries to fix the error by
encoding the input.

xmlCharEncOutput is collecting the number of bytes written to the
output buffer but is returning zero instead of the total number of
bytes in this situation. This commit will fix this issue by returning
the total number of bytes instead. So the xmlNodeDumpOutput will also
continue writing and will not stop due to the fact that it mistakenly
thinks the output buffer is not changed in that iteration.

Fixes #314
2022-01-16 15:08:44 +01:00
David Kilzer
03bb929390 Fix parse failure when 4-byte character in UTF-16 BE is split across a chunk
This makes the logic in UTF16BEToUTF8() match UTF16LEToUTF8().

* encoding.c:
(UTF16LEToUTF8):
- Fix comment to describe what the code does.
(UTF16BEToUTF8):
- Fix undefined behavior which was applied to UTF16LEToUTF8() in
  2f9382033e.
- Add bounds check to while() loop which was applied to
  UTF16LEToUTF8() in be803967db.
- Do not return -2 when (in >= inend) to fix the bug.  This was
  applied to UTF16LEToUTF8() in 496a1cf592.
- Inline (<< 8) statements to match UTF16LEToUTF8().

Add the following tests and results:

  test/text-4-byte-UTF-16-BE-offset.xml
  test/text-4-byte-UTF-16-BE.xml
  test/text-4-byte-UTF-16-LE-offset.xml
  test/text-4-byte-UTF-16-LE.xml
2022-01-16 14:07:17 +01:00
David King
b92b16f659 Remove unused variable in xmlCharEncOutFunc
Fixes a compiler warning:

encoding.c: In function 'xmlCharEncOutFunc__internal_alias':
encoding.c:2632:9: warning: unused variable 'output' [-Wunused-variable]
 2632 |     int output = 0;

https://gitlab.gnome.org/GNOME/libxml2/-/issues/254
2021-05-23 11:55:32 +02:00
Nick Wellnhofer
dcb80b92da Fix slow parsing of HTML with encoding errors
Under certain circumstances, the HTML parser would try to guess and
switch input encodings multiple times, leading to slow processing of
documents with encoding errors. The repeated scanning of the input
buffer when guessing encodings could even lead to quadratic behavior.

The code htmlCurrentChar probably assumed that if there's an encoding
handler, it is guaranteed to produce valid UTF-8. This holds true in
general, but if the detected encoding was "UTF-8", the UTF8ToUTF8
encoding handler simply invoked memcpy without checking for invalid
UTF-8. This still must be fixed, preferably by not using this handler
at all.

Also leave a note that switching encodings twice seems impossible to
implement correctly. Add a check when handling UTF-8 encoding errors
in htmlCurrentChar to avoid this situation, even if encoders produce
invalid UTF-8.

Found by OSS-Fuzz.
2021-02-20 21:28:56 +01:00
Xiaoming Ni
649d02eaa4 encoding: fix memleak in xmlRegisterCharEncodingHandler()
The return type of xmlRegisterCharEncodingHandler() is void. The invoker
cannot determine whether xmlRegisterCharEncodingHandler() is executed
successfully. when nbCharEncodingHandler >= MAX_ENCODING_HANDLERS, the
"handler" is not added to the array "handlers". As a result, the memory
of "handler" cannot be managed and released: memory leakage.

so add "xmlfree(handler)" to fix memory leakage on the failure branch of
xmlRegisterCharEncodingHandler().

Reported-by: wuqing <wuqing30@huawei.com>
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
2020-12-07 14:38:14 +01:00
Frederik Seiffert
b516ed189e Fix building with ICU 68.
ICU 68 no longer defines the TRUE macro.

Closes #204.
2020-11-19 18:10:32 +01:00
Nick Wellnhofer
1e41e4fa8e Fix return values and documentation in encoding.c
Make xmlEncInputChunk and xmlEncOutputChunk return 0 on success and
never a positive value.

Make xmlCharEncFirstLineInt, xmlCharEncFirstLineInt and
xmlCharEncOutFunc return the number of bytes written.
2020-07-06 15:06:13 +02:00
Nick Wellnhofer
2f9382033e Fix undefined behavior in UTF16LEToUTF8
Don't perform arithmetic on null pointer.

Found with libFuzzer and UBSan.
2020-06-15 21:23:54 +02:00
Nick Wellnhofer
a697ed1e24 Fix return value of xmlCharEncOutput
Commit 407b393d introduced a regression caused by xmlCharEncOutput
returning 0 in case of success instead of the number of bytes written.
Always use its return value for nbchars in xmlOutputBufferWrite.

Fixes #166.
2020-06-15 15:23:38 +02:00
Nick Wellnhofer
20c60886e4 Fix typos
Resolves #133.
2020-03-08 17:41:53 +01:00
Jared Yanovich
2a350ee9b4 Large batch of typo fixes
Closes #109.
2019-09-30 18:04:38 +02:00
Andrey Bienkowski
d2293cdbc8 Remove a misleading line from xmlCharEncOutput
Closes: https://bugzilla.gnome.org/show_bug.cgi?id=793028

It seams this line was accidentally copied over from xmlCharEncOutFunc.
In xmlCharEncOutput output is a pointer so incrementing it by ret can
point it where it wasn't supposed to be pointing. Luckily the current
implementation doesn't dereference the pointer after advancing it.

Signed-off-by: Daniel Veillard <veillard@redhat.com>
2018-07-23 10:21:38 +08:00
Nick Wellnhofer
772c06487b Fix unused parameter warning without ICU 2017-11-09 17:56:31 +01:00
Joel Hockey
0b19f236a2 Fixed ICU to set flush correctly and provide pivot buffer.
By always setting flush=TRUE when doing multiple reads, ICU
will not correctly handle truncated utf8 chars across read
boundaries.

The fix is to set flush=TRUE only on final read, and to
provide a pivot buffer which is maintained by libxml
between calls to ucnv_convertEx.
2017-11-04 15:25:31 +01:00
Nick Wellnhofer
e5107772ff Fix pathological performance when outputting charrefs
If a character can't be represented in the output encoding, it is
converted to a character reference. This used to to replace the
character in the input stream by calling xmlBufAddHead or
xmlBufferAddHead. These functions shifted the entire input array
around, leading to quadratic performance when converting a run of
non-representable characters. This is most pronounced when dumping to
memory.

Output the charref directly instead.

Found with libFuzzer.
2017-06-19 16:06:21 +02:00
Nick Wellnhofer
c9ccbd6a6d Deduplicate code in encoding.c
Introduce static functions xmlEncInputChunk and xmlEncOutputChunk
that handle the internal/iconv/ICU branching.
2017-06-19 16:06:21 +02:00
David Kilzer
4472c3a5a5 Fix some format string warnings with possible format string vulnerability
For https://bugzilla.gnome.org/show_bug.cgi?id=761029

Decorate every method in libxml2 with the appropriate
LIBXML_ATTR_FORMAT(fmt,args) macro and add some cleanups
following the reports.
2016-05-23 15:01:07 +08:00
Gaurav
080a22c5ea Avoid a possibility of dangling encoding handler
For https://bugzilla.gnome.org/show_bug.cgi?id=711149

In Function:
int xmlCharEncCloseFunc(xmlCharEncodingHandler *handler)

If the freed handler is any one of handlers[i] list, then it will make that
hanldlers[i] as dangling. This may lead to crash issues at places where
handlers is read.
2013-11-29 23:10:50 +08:00
Denis Pauk
e28c8a1ace #705267 - add additional defines checks for support "./configure --with-minimum"
https://bugzilla.gnome.org/show_bug.cgi?id=705267
2013-08-03 22:00:17 +08:00
Daniel Veillard
bf058dce13 Fix the flushing out of raw buffers on encoding conversions
https://bugzilla.gnome.org/show_bug.cgi?id=692915

the new set of converting functions tried to limit the encoding
conversion of the raw buffer to the consumption one to work in
a more progressive fashion. Unfortunately this was bad for
performances and led to errors on progressive parsing when
a very large chunk was close to the end of the document. Fix
the new internal function and switch back to the old way of
converting. Fix another bug in the process.
2013-02-13 18:19:42 +08:00
Petr Sumbera
6f49c73b53 Try IBM-037 when looking for EBCDIC handlers
http://en.wikipedia.org/wiki/EBCDIC_037
as it is another variat of EBCDIC
2012-12-12 15:41:30 +08:00
Daniel Veillard
f8e3db0445 Big space and tab cleanup
Remove all space before tabs and space and tabs at end of lines.
2012-09-11 13:26:36 +08:00
Daniel Veillard
28cc42d068 Regenerating docs and API files
Various cleanups
* configure.in: force regeneration of APIs in my environment
* buf.c buf.h enc.h encoding.c include/libxml/tree.h
  include/libxml/xmlerror.h save.h tree.c: various comment cleanups
  pointed by apibuild
* doc/apibuild.py: added the 3 new internal headers in the excludes
* doc/libxml2-api.xml doc/libxml2-refs.xml: regenerated the API
* doc/symbols.xml: listing new entry points for 2.9.0
* doc/devhelp/*: regenerated
2012-08-10 10:00:18 +08:00
Daniel Veillard
18d0db2503 Adding new encoding function to deal with the new structures
* encoding.c: adds xmlCharEncFirstLineInput, xmlCharEncInput and
  xmlCharEncOutput
* enc.h: the functions are not made public but added to this new header
2012-07-23 14:24:26 +08:00
Timothy Elliott
689408bd86 Prevent an infinite loop when dumping a node with encoding problems
When a node is dumped with a new encoding, we may encounter characters
that are not supported in the new encoding. libxml2 handles this by
replacing the character with character references, but in some encodings
this can result in an infinite loop when the character references
themselves contain unsupported characters.

This fixes the infinite loop by undoing a character reference substitution
when it cannot be inserted, and returning an encoder error.

This bug was noticed when looking into an infinite loop bug report for
the Ruby Nokogiri project. The original bug report, "nokogiri process
hangs on call to inner_html" is here:
https://github.com/tenderlove/nokogiri/issues/400
2012-05-08 22:03:22 +08:00