Fix GH-20439: xml_set_default_handler() does not properly handle special characters in attributes when passing data to callback #20453
+74
−82
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We would need to escape the attributes, but there's no builtin method that we can call in libxml2 to do so in a way consistent with the attribute escape rules and expat. The two escape functions that are exposed are
xmlEncodeEntitiesReentrantandxmlEncodeSpecialCharsand they use the internalxmlEscapeTextfunction. However, we can't access the right flag and that function from outside of libxml2.In fact, expat just repeats the input, while we reconstruct it. To fix the issue, and fix consistency with expat, we repeat the input as well. This works by seeking to the start and end of the tag and passing it to the default handler. This is fine for the parser because the parser used in ext/xml is always in non-progressive mode, so we have access to the entire input buffer. Since the grammar of XML does not allow '<' and '>' in start elements or inside self-closing elements, seeking works fine. A self-closing tag ends its event at the solidus. Expat emits one event: only a start tag default. The compat layer emits two events, we keep BC by keeping the emission of two events and replace the solidus with a '>'.
A nice side effect is that this PR reduces the amount of code in the compatibility layer nicely.