-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rendering of XML is lossy #146
Comments
I hit the same issue when writing property tests to prove that Would it be acceptable if |
So I am looking into building some When thinking about this design I noticed that it would be quite common for Based on the above purpose it would really be ideal if the types themselves ruled out accidentally writing a lossy The other case I found is that |
Off the top of my head, I can't see a way to encode that constraint into the types without making the API significantly more complex. If you have a specific implementation in mind, feel free to submit a pull-request. |
So I think the two ways I gave, either treating chars as nodes themselves instead of Personally I think the latter makes a lot of sense. Often when you are looking at the children of a node you only really care about the non-text children (particularly when the text nodes are just whitespace), and this makes that easy (but without being lossy). If neither of those are acceptable. Then another alternative is simply a smart constructor / newtype for the node list that concatenates subsequent text nodes and deletes empty text nodes. |
If I understand correctly your preferred solution, it would require re-defining at least:
Rather than deviating from Also, should the |
My 2 cents is that this is a non-problem, and any attempt to fix this will have massive API breakage and likely introduce some serious performance impacts. If handled incorrectly, it could even expose a major security flaw with OOM attacks via streaming large amounts of text as a single event. |
Since
And yeah this seems like it should be pushed into I don't thing |
I mean depends on your goals I suppose. For using XML for bidirectional serialization it seems pretty important. But if you are only expecting users to go in one direction and use other libraries/formats for bidirectionality then sure.
I'm unsure why there would be any performance impact, but I could be wrong. But yeah there will be some API breakage, although on the plus side converting between the new and old format will be fairly trivial, but unfortunately still a breaking change.
I'm extremely unsure how any of this would change, it's just a small change from |
The only real problem I'm seeing implied is that it makes it difficult to test roundtripping in general. A single
This happens in the
I'd rather see explanations of why this is important than a claim of it. Most serialization formats do not have a guarantee of bidirectionality in all cases. Perhaps |
I feel like this is the single most important aspect of serialization formats. When you want to serialize something, the main thing you care about is being able to get the exact original value back when you deserialize. The easiest way to ensure the above is to make sure that every transformation toward the final serialized value is injective/bidirectional. For example in Aeson: Anytime any layer in the process deviates from this it tends to cause problems. For example right now |
I can't help but feel that what is asked is something XML doesn't guarantee in the first place. After all if I'm not mistaken, NodeContent (ContentText in unresolved) are the Character Information Items from 2.6 of XML Infoset and per the same applications are free to group (or split: the logical unit in XML text is actually one character) as they want. Specifically, how do you denote in the XML textual form the difference between, say, |
XML doesn't take a stance one way or the other on whether or not any given type/serializer/deserializer combination is lossless. In the most trivial case if your XML type is basically just the Haskell equivalent of:
Then the serializer is trivially injective.
Instead of trying to differentiate those values in XML, which I agree you cannot, you must un-differentiate them in Haskell. I mentioned some ways to head in that direction above. |
You're asking in the context of a serialisation in XML though. That means that if you can't differentiate them in XML, being able to differentiate them in Haskell at render time avails you little because you won't be able to differentiate them at parse time. Also, in the particular case of successive NodeContent/ContentText, if I'm not mistaken and they represent XML's Character Information Items, as I understand the XML Infoset recommendation it actually takes the stance that one shouldn't count on a specific split. |
I’m saying we should make them less differentiable in Haskell. The same XML should have the exact same Haskell representation. We should make it impossible to represent the same XML with two different Haskell values, which is totally possible. |
The problem I see is that while trying for a given parse to always give the same result seems fair enough — and even there there's a wrinkle with what will the streaming parser do if it hits the end of a P.S.: Thinking of it, there's also that not only do two different |
Specifically
parseText def . renderText def
is not equivalent toRight
. This means you cannot render aDocument
and send it over the wire and then parse it on the other side and expect to get back what you put in.For constrast
Value
fromAeson
does not suffer from this issue, largely because it benefits from JSON being much simpler than XML.One specific lossy-ness is from
[NodeContent "x",NodeContent "y"]
being rendered to justxy
and thus re-parsed as[NodeContent "xy"]
.I am not aware of a reasonable and spec-based way to render the two above structures differently, which means that if injectiveness is desired then the above must be unified to the same Haskell value.
One option for achieving that is replacing
[Node]
with something isomorphic to([(Text, Node)], Text)
whereNode
no longer has aNodeContent
constructor. Another alternative is to haveNodeContent
containChar
instead ofText
.I understand if the complexity of the above is not desired, and that losing the injectivity of
Document
is an acceptable cost, but I wanted to bring this issue up for discussion.The text was updated successfully, but these errors were encountered: