-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A case study for XSLT transformation of JSON: the transpiler #1786
Comments
The XML emitted by the JavaParser has an interesting structure. For example a conditional
The content model of an element here is determined not by the element name, but by the value of the (Is this what SGML old-timers used to call "architectural forms"?) In consequence, much of the transpiler consists of template rules in the form Because the content model depends on the |
This is what elements-to-maps() does with this fragment, when defaulting all options. (More precisely, this is a fragment of the result of applying elements-to-maps() to the containing XML document:
At first sight this looks quite reasonable, and certainly something that's possible to work with, and something that's not structurally different from what the JavaParser people might have chosen to emit if they had been outputting JSON directly. Looking in more detail, the only real failure is in handling elements with a single child. For example we get:
which should really be a list, because in the general case (As it happens, the XML adopts a convention that list-valued elements always have a name ending in "s" that pluralises the name of the child element, even when this leads to names like |
As suggested in #1645 I experimented with adding the layout to the element name for diagnostics. This gives us:
I can certainly see the value of this. (There is one problem though: because the keys in the map are changed, any subsequent processing of the map is likely to fail.) Note that this XML vocabulary does not use text nodes: all the leaf information goes into attribute nodes. |
My next attempt was to use the "uniform" option to analyze the full XML (2144 files). I used this query:
and it threw this error: I think that must be a Saxon bug: |
I've made some experimental changes to the Saxon elements-to-maps implementation, including an option to export the layout options inferred by uniform="yes" and then use them in a subsequent call to
This time I've included some uppercase property names which result from Saxon calling the "symbol solver" in JavaParser to infer type information. I'm now going to look at how to write XSLT 4.0 code to convert this back to Java. I'm starting with a stylesheet designed to do this with the XML form of the data, and adapting it as needed. Many of the template rules are of the form
and there are two things we need to change here: the match pattern, and the apply-templates call. We can change this one to:
The "." in the match pattern matches anything, but the lookup in the predicate will fail unless it's a map or an array; a failure while evaluating a pattern is treated as "no match" so that's OK (apart perhaps from performance implications in Saxon of throwing and catching errors...). Instead of "." we could write We can't do I'll convert the simple template rules following this pattern and then report on those that are trickier. |
There are places where we want to match by the key rather than by a property. At the root level of the JSON we have:
If we're processing the root, how do we apply templates to the packageDeclaration? I think we have to process a map by applying templates to its entries. This is articulated in the current 4.0 spec for the shallow-copy-all mode of processing - but the implications on match patterns aren't really worked through. We could model the entries as singleton maps, or as key value pairs. Whichever we do we have to ask how the entry would be matched, and the obvious answer is by a pattern of the form So this suggests the principles:
On this basis we could write the top-level processing of our JSON document as:
I'm not really comfortable with this, however. It feels wrong that the thing that we're matching in the match pattern is a map entry, but the thing that we're processing in the body of the template rules is the value part of the map entry; that just feels like it's calling for trouble. If we used key-value-pair records instead of singleton maps to represent map entries, it would become (assuming KVP is available as a named record type):
which is much clunkier syntax, but conceptually cleaner. It almost feels better to invent new constructs:
|
An alternative approach is that we apply-templates to the values (not the entries), and rely on the fact that because the whole tree is pinned, the values are labelled with the corresponding key. The apply-templates would then have (implicitly or expllicitly) |
I'm going to focus now on writing this stylesheet using facilities currently defined in XSLT 4.0, in order to discover where the rough edges are, and then we'll think about new features to ease the pain when we've done that. So, given this structure:
how do I apply-templates to all the "children" of "root" except for "_nodeType"? There are two places to do the filtering, in the apply-templates call, and in the template rules. Possible ways of doing it in apply-templates:
In Saxon, the last is probably the most performant, which is a little unfortunate, because the other expressions are shorter. Of course there's always room for new optimizations. Doing it through template rules? Without relying on values being pinned, the only way is
which is a tad unselective. If the values are pinned, then we can do
which feels rather clumsy. We could dream of
|
Noted that with "predicate patterns" there's no equivalent to the union operator - union, intersect, and except patterns apply only to templates that match nodes. |
I've now got the stylesheet to run to completion, though the output isn't yet quite right. Most of the template rules have the general form
So far I've managed to avoid relying on the tree being pinned. The main limitation of that is that I can't match KVPs by their key. Matching by the Converting the stylesheet from one that processed XML to one that processed JSON was reasonably straightforward - mainly a question of fixing one type error at a time. The match patterns such as |
I've now got the JSON-to-Java stylesheet working to the extent that it is producing satisfactory results for a couple of reasonably-sized Java modules; it's not yet production-quality, but it's good enough that we can stand back and see what we've learned. I was able to do it entirely without relying on the structure being "pinned": there was no need for any upward navigation. To a considerable extent that's a fortunate accident of the way this particular vocabulary is designed: the processing of individual records depends on the The bottom-line conclusion is that it is possible, and not terribly difficult, to write this particular transformation using XSLT 4.0 in its current state. Observations:
We typically process the string-valued properties using Using Template rules in the original XML-based stylesheet sometimes matched the element name, and sometimes the |
One of the design aims of XSLT 4.0 is that it should be easier to transform JSON. Back in 2016 I published a paper at XML Prague (https://www.saxonica.com/papers/xmlprague-2016mhk.pdf) with the rather disappointing result that for a couple of non-trivial JSON transformation tasks, the easiest solution was to convert the JSON to XML, transform the XML, and then convert it back. In many ways it was that discovery that motivated the whole XSLT 4.0 project. So I want to review to what extent we have solved that problem, and what remains to be done. In particular, I have recently raised a number of open issues related to how we transform JSON-derived trees of maps and arrays using template rules, and I'm not sure we can resolve those issues without testing the proposals against real use cases.
I'm proposing to take as a case study the Java-to-C# transpiler which we described in a 2021 paper at https://www.saxonica.com/papers/markupuk-2021mhk.pdf. This is a real XSLT application in daily use. It invokes the (open source) JavaParser to emit an XML representation of Java source code, it performs various transformations of that XML, and then finally spits out equivalent C# source code. My basic question is: suppose the JavaParser had chosen to emit JSON instead of XML (as it might perfectly reasonably have chosen to do). Would we be able to write the transpiler in XSLT 4.0 to work entirely within the JSON space, avoiding all use of XML?
I chose this case study for several reasons:
I looked at a couple of other candidates, and found they were things that could be readily done in XSLT 3.0 without any enhancements. For example we have production XSLT 3.0 code that takes a JSON data feed from our online shop at saxonica.com and uses it to update our sales database and to generate license keys. The JSON is voluminous but the structure is simple, and the constructs in XSLT 3.0 for handling maps and arrays are entirely up to the job. The transpiler differs in that the JSON has a much more interesting recursive structure, making rule-based transformation a natural fit to the task.
I'm not proposing to actually produce a complete replacement of the current transpiler, only to explore the task of doing so in enough detail to get some useful insights. I propose to use this issue tracker to capture my working notes as the study proceeds, but if there are recommendations affecting the 4.0 specs (as seems likely), then I will extract those into separate issues. Perhaps at the end of the process I will write up the case study as a conference paper.
My rough plan is as follows:
Using this format (a GitHub issue) to record progress carries a risk that there will be comments that take things off at a tangent. Please help by resisting that temptation: if there are interesting issues raised in your mind, please take those up as separate issues.
The text was updated successfully, but these errors were encountered: