Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: rare decoding error on UTF-8 documents #1579

Merged
merged 1 commit into from
Dec 29, 2024

Conversation

rdeltour
Copy link
Member

On rare occasions, decoding UTF-8 documents caused a fatal error RSC-016 (Invalid byte 2 of 4-byte UTF-8 sequence.).

This was likely due to a bug in the Xerces XML parser decoding component, see https://issues.apache.org/jira/browse/XERCESJ-1668

As a workaround, we now read documents using the Java built-in UTF-8 decoder instead of Xerces's own decoder, by creating the SAX parsers' InputSource from an InputStreamReader instead of the raw InputStream.

Fixes #1548

@rdeltour rdeltour added the status: ready to merge The pull request is ready to be merged label Dec 23, 2024
@rdeltour rdeltour added this to the Next maintenance release milestone Dec 23, 2024
@rdeltour rdeltour self-assigned this Dec 23, 2024
Base automatically changed from fix/1546/undefined-entities to main December 23, 2024 10:03
On rare occasions, decoding UTF-8 documents caused a fatal error RSC-016
(`Invalid byte 2 of 4-byte UTF-8 sequence.`).

This was likely due to a bug in the Xerces XML parser decoding component,
see https://issues.apache.org/jira/browse/XERCESJ-1668

As a workaround, we now read documents using the Java built-in UTF-8
decoder instead of Xerces's own decoder, by creating the SAX parsers'
InputSource from an InputStreamReader instead of the raw InputStream.

Fixes #1548
@rdeltour
Copy link
Member Author

Also fixes #1554

@rdeltour rdeltour merged commit 90e87b2 into main Dec 29, 2024
5 checks passed
@rdeltour rdeltour deleted the fix/1548/invalid-utf8-sequence branch December 29, 2024 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: ready to merge The pull request is ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unjustified error message "Invalid byte 2 of 4-byte UTF-8 sequence"
1 participant