Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fn:parse-xml: DTDs, external resources #1860

Closed
ChristianGruen opened this issue Mar 5, 2025 · 10 comments · Fixed by #1879
Closed

fn:parse-xml: DTDs, external resources #1860

ChristianGruen opened this issue Mar 5, 2025 · 10 comments · Fixed by #1879
Labels
Enhancement A change or improvement to an existing feature PR Pending A PR has been raised to resolve this issue XQFO An issue related to Functions and Operators

Comments

@ChristianGruen
Copy link
Contributor

The text doesn’t say much about what DTD validation means. Is my assumption correct that it boils down to a SAXParserFactory.setValidating call in Java?

What about DTDs in general? Given the following snippets (using the default false for DTD validation)…

<!-- xml.dtd -->
<!ENTITY arrow "→">

parse-xml(`
  <!DOCTYPE xml SYSTEM 'xml.dtd'>
  <xml>&arrow;</xml>`
)

…should the result be <xml/>, <xml>→</xml>, or an error? In other words, should the (potentially external) xml.dtd resource be resolved and interpreted?

Maybe we should introduce an additional DTD option (or options?) to control the loading of external DTDs and the handling of entities, for example:

http://apache.org/xml/features/nonvalidating/load-external-dtd
http://xml.org/sax/features/external-general-entities
http://xml.org/sax/features/external-parameter-entities

Thoughts are welcome.

@ChristianGruen ChristianGruen added Enhancement A change or improvement to an existing feature XQFO An issue related to Functions and Operators labels Mar 5, 2025
@michaelhkay
Copy link
Contributor

michaelhkay commented Mar 5, 2025

My instinct is to avoid over-complicating it, but yes, in principle, these options could be added. The difficulty is that it might be difficult to make them fully interoperable - the XML parser in C#, for example, offers a rather different range of options. (On some environments it's hard to find an XML parser that does DTD validation at all.) The basic concept of DTD validation is defined in the XML spec and I don't think that gives any problems (yes, it DOES involve expanding external entities): perhaps additional options are best left vendor-defined.

@ndw
Copy link
Contributor

ndw commented Mar 6, 2025

My two pence:

Names beginning "xml" are reserved, so <xml> is a problematic name 😆 . I don't recall if software checks that (except for processing instructions).

I think a reasonable user expectation is that entities declared in the internal subset will be expanded unless some extra effort has been taken to prevent that.

If you don't expand the entity, then that's an error. The XDM has no way to represent an unexpanded entity reference.

@ChristianGruen
Copy link
Contributor Author

ChristianGruen commented Mar 6, 2025

Thanks; your pence are worth many pounds.

I believe that an option could be helpful that allows you to use fn:parse-xml more safely with arbitrary strings as input if DTD-related security risks are prevented (XXE, recursive entity expansion, etc.). At the same time, I fully agree with Michael that it seems difficult to define such an interoperable option.

Our processor comes with a system-wide Boolean dtd parser option that is assigned to the three (very Xerces-specific) SAXParser features that I mentioned in my first commit. If the option is disabled (and if no custom entity resolver is used), the standard behavior is that unknown entities are simply ignored (instead of <xml>→</xml>, <xml/> is returned).

Raising an error, returning U+FFFD or allowing a fallback option may be a better choice for fn:parse-xml.

@ndw
Copy link
Contributor

ndw commented Mar 6, 2025

Returning U+FFFD or ignoring unknown entities may be useful behaviors, but they're not conformant behaviors for an XML parser. (Returning an unparsed entity information item would be conformant, but the XDM doesn't support them, so that's not really practical.)

@ChristianGruen
Copy link
Contributor Author

ChristianGruen commented Mar 6, 2025

Returning U+FFFD or ignoring unknown entities may be useful behaviors, but they're not conformant behaviors for an XML parser.

Xerces seems to be lenient. The following snippet outputs ! [, ? arrow and ! ] (if skippedEntity is not overwritten, it’s a no-op):

String string = "<!DOCTYPE root SYSTEM 'root.dtd'><root>[&arrow;]</root>";
InputSource is = new InputSource(new StringReader(string));
SAXParserFactory sf = SAXParserFactory.newInstance();
sf.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
sf.newSAXParser().parse(is, new DefaultHandler() {
  @Override public void characters(char[] ch, int start, int length) throws SAXException {
    System.out.println("! " + new String(ch, start, length));
  }
  @Override public void skippedEntity(String name) throws SAXException {
    System.out.println("? " + n);
  }
  @Override public void warning(SAXParseException e) throws SAXException { throw e; }
  @Override public void error(SAXParseException e) throws SAXException { throw e; }
  @Override public void fatalError(SAXParseException e) throws SAXException { throw e; }
});

@ndw
Copy link
Contributor

ndw commented Mar 7, 2025

I think there are a few things going on here. At the level of the handler, whether or not the parser is conformant is kind of up to you, the handler writer. And at this level of detail, the XML recommendation is actually inadequate. I think you can read the XML recommendation as requiring that a processor report only characters and processing instructions. That's clearly not what was intended, but if you're trying to work out what the spec says not what its authors meant, that's the sort of trouble you get into.

You can write a handler that changes all element names to "SPOON" or discards attributes that have names that don't begin with a Unicode code point that's an odd number, or, well, you're mostly limited by your imagination.

Users will surely appreciate having options that allow them to load documents that have undeclared entities in them, or to load or not load external subsets, to expand or not expand entity references, etc.

Trouble is, which ones of those options an implementation can actually support may depend on the underlying libraries that are being used. Xerces has a lot of options and it's widely available. But it's not the same as .NET's System.Xml, and then there's Node.js where things are really impoverished.

I'm not really sure what the right answer is.

@michaelhkay
Copy link
Contributor

The solution might be to define some options that implementations are not required to support: for example "external-entity-expansion-limit" if supported limits the number of external entity references that will be expanded, with the precise effect being implementation-dependent, and processors may ignore the option if not supported by the underlying parser.

@ChristianGruen
Copy link
Contributor Author

A classical example for an XXE attack that we may want to prevent:

`<!DOCTYPE root [ <!ENTITY xxe SYSTEM "file:///etc/passwd"> ]>
<root>&xxe;</root>`
=> parse-xml()

@michaelhkay
Copy link
Contributor

It seems to me that preventing access to file:///etc/passwd is probably something that ought to happen at a higher level than in the options to a specific call on parse-xml().

@martin-honnen
Copy link

In the context of Saxon (currently looking at both Saxon 12.5 Java and SaxonC Java) I had hoped that the allowedProtocols set to e.g. https,http would prevent the file URI access in the XQuery execution of parse-xml but that doesn't seem to be the case https://saxonica.plan.io/issues/6711.

@michaelhkay michaelhkay added the PR Pending A PR has been raised to resolve this issue label Mar 14, 2025
@ndw ndw closed this as completed in #1879 Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement A change or improvement to an existing feature PR Pending A PR has been raised to resolve this issue XQFO An issue related to Functions and Operators
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants