-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 'materialize_refs' function #47
base: master
Are you sure you want to change the base?
Conversation
|
@andreineculau your comment about 'replace' semantics is interesting. I checked the speck and it looks like you are right http://json-schema.org/latest/json-schema-core.html#rfc.section.7
But, at the same time, jesse itself doesn't follow this semantics. It evaluetes rules one-by-one and just perform sub-check when But here I see another problem: when using maps and schema and subschema both have the same key(s), result will be different, compared to other list-based object representations: http://erlang.org/doc/man/maps.html#from_list-1
The same spec states, that circular references are forbidden, so, I'm not sure if we sould add explicit protection.
|
Correct, we do have a bug. I just forgot to open an issue. And about bad - anything that doesn't follow a standard, even a bad standard, is bad :) It's late, and I cannot follow the map problem fully. Care to be more specific? There are 2 diff concepts that you're mixing. One, presented in the excerpt is that you would not be able to validate a JSON schema due to infinite recursion. What I'm talking about is that you would not be able to expand the schema due to infinite recursion. The latter is very much valid e.g. a cube made of cubes made of cubes... |
bump. Have you given any thoughts to the infinite recursion? Maybe you could you add an option to have a limit of expansion levels before giving up and leaving the references in place? |
Sorry for digging up a dead horse but after chatting with @seriyps I'd like to take a stab at finalizing this work. With a very complex JSON schema I can see amazing performance gains from resolving all references and having a pre-built schema in place. Unresolved: Method Loops Min Max Median Average
empty loop 2156545 * 333 = 718129485 0.003us 2.649us 0.003us 0.003us
erl_eval:'-expr/5-fun-3-'() 3 * 50 = 150 90370.080us 91817.680us 90370.080us 91120.733us Resolved: Method Loops Min Max Median Average
empty loop 2221610 * 333 = 739796130 0.003us 1.952us 0.003us 0.003us
erl_eval:'-expr/5-fun-3-'() 71 * 50 = 3550 2595.140us 4263.480us 2775.120us 2817.803us Average drops from 90.37 ms to 2.82 ms :-) I'll try and devote some time to this during the upcoming week. |
Thanks @onno-vos-dev ! IFF you can, would be interesting to see more exactly where we lose performance. I'm thinking that there's a chance we can improve performance regardless of materializing refs or not. |
I'll try and grab some flame graphs later today 👍 From a flamegraph I have (from a much wider code flow which let me down this archaeology trip in the first place) I'll see if I can either flamegraph it separately alternatively if I can reduce the schema a tad to try and reduce the stack and make the problematic codepath a bit more visible. |
Apparently Github does not allow uploading Anyway, sorry for the delay, I've had a few things that took priorities but here are two SVGs, Minor disclaimer, I only recently started digging into One can clearly see in
Considering the above callstack we can eliminate steps 2 -> 5 with a resolved schema. If we dive into this particular part of the callstack to see if we can find anything worth optimizing on it's own, my thinking is the only candidate would be the call to Looking into this particular part we see the following: When digging into two examples, the resolution of the Datetime or UUID References in my schema, we can see it takes roughly 150 us to complete. 1> perftest:test([fun() -> jesse_state:resolve_ref(State, <<"definitions.json#/definitions/Datetime">>) end]).
Method Loops Min Max Median Average
empty loop 2816498 * 333 = 937893834 0.003us 2.039us 0.003us 0.003us
erl_eval:'-expr/5-fun-3-'() 1291 * 50 = 64550 132.860us 335.160us 145.260us 154.481us
ok
2> perftest:test([fun() -> jesse_state:resolve_ref(State, <<"definitions.json#/definitions/UUID">>) end]).
Method Loops Min Max Median Average
empty loop 2863600 * 333 = 953578800 0.003us 2.183us 0.003us 0.003us
erl_eval:'-expr/5-fun-3-'() 1256 * 50 = 62800 135.840us 533.880us 146.700us 158.788us
ok Even though that may not sound like much, validating a single JSON message that I've been testing with, 359 References must be resolved at runtime. Easy math tells us that we're spending roughly 53-56ms just resolving Refs in this schema. There's a lot going on in In my eyes, if you're dealing with a JSON Schema that has 359 References, you're on your own. 😄 Resolving all references at compile-time makes perfect sense unless it's something that does not happen often enough to make it worth the trouble. It seems trivial in my mind at least to set up some compile-step which simply resolves all the references and stores the finalized schema. This way the developer keeps the best of both worlds where when working on the schema, you have nice references that you can refer to while the runtime code no longer has to resolve them constantly. Alternatively do something like @seriyps stated, allow an option to resolve all references prior to storing them in jesse database. In order to take this PR forward, can we try and come up with a few things required for this to become merge-able?
Would that be sufficient? :-) file info:
virustotal.com: scan result |
I think one of the issues I experienced when creating the original PR were that resolving the reference involves ETS lookup. And ETS lookup means memory copying from ETS to process heap.
and will try to validate an array of 1000 items the reference is going to be resolved 1000 times |
bump
|
So, this function just resolves references and replaces them with referenced schema inplace.
would become
Works with remote references as well.
It's just prototype, I'm not expecting it to be merged right now, but let's discuss what else I should add? Readme? Tests?
Also, probably there should be some option to automatically resolve references before storing schema to jesse database.
Why this was implemented? Initialy to help with debugging large schemas with remote refs (we have a lot of schemas referencing common
types.json
schema).And most likely it may improve performance (because, AFAIK, each remote
$ref
occurence will cause jesse_database:lookup. And this may happen in a loop), but this isn't used yet.