Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gateway): JSON and CBOR response formats (IPIP-328) #9335

Merged
merged 39 commits into from
Dec 5, 2022
Merged

Conversation

hacdias
Copy link
Member

@hacdias hacdias commented Oct 6, 2022

Supersedes #8037
Closes #8823
IPIP: ipfs/specs#328

The List (mostly borrowed from #8823)

@hacdias hacdias self-assigned this Oct 7, 2022
@hacdias hacdias changed the title feat: gateway support for dag-json and dag-cbor responses feat(gateway): JSON and CBOR response formats Oct 10, 2022
@hacdias

This comment was marked as resolved.

@rvagg
Copy link
Member

rvagg commented Oct 12, 2022

Hey, it might be worth considering just mapping application/json straight to dag-json, and application/cbor to dag-cbor for this. Since they're output codecs then you're going to get the same thing, except for JSON which will give an error on outputting blocks with Links and/or Bytes—in that case it'd probably be more useful to the user to get the {"/"... forms of both since dag-json is valid json. Likewise, dag-cbor is valid cbor, just with strictness and use of tag 42 for CIDs—and in this case, you'd get the same output anyway because a CBOR tag is a tag. As an output format it's got the same strictness because it's working off the data model too.

Unless you want pretty-printing for JSON, which is slightly different, but I also think that's probably not wanted out of the gateway.

@hacdias
Copy link
Member Author

hacdias commented Oct 12, 2022

@rvagg the idea here was that, if you request application/json/application/cbor, we first check if the resolved CID actually points to something encoded in JSON/CBOR. If so, use those decoders. If not, use DAG-JSON/DAG-CBOR. Otherwise, we may end up having issues with some types of data. What about JSON that is not DAG-JSON? Likewise for CBOR that might not be DAG-CBOR?

@rvagg
Copy link
Member

rvagg commented Oct 13, 2022

@hacdias that sounds like a good approach—if something is already in that codec then you don't really need to decode, so presumably you won't even need to touch those codecs since you're just posting back the raw bytes?

One major difference worth noting with the JSON codec in Go is that it pretty-prints the output. I'd like to reconcile that with the JS codec which doesn't and I've been asking for feedback whether that's helpful or not. I don't know if it's relevant here if you're going to default to the DAG-JSON codec if the block isn't already plain JSON, because you'll just get what they encoded it as—unless you're planning on doing a round-trip through the codec again?

What about JSON that is not DAG-JSON?

Shouldn't be a problem for the most part I think, but there may be edges around large integers because iirc JSON doesn't strictly have a limit on int max/min and who knows what people are throwing into IPFS. In Go we limit maximum size at int64 boundaries.

Likewise for CBOR that might not be DAG-CBOR?

This gets more tricky because "CBOR", like "JSON", isn't as strictly defined in terms of an IPLD codec and CBOR can have all sorts of funky things inside of it that don't fit properly in the data model—non-string map keys, tags for dates, regexes, and all sorts of weird things; plus you have the big integer problem too since there are ways to represent them but the Go implementation uses int64—for standard CBOR (including DAG-CBOR) there's this space between int64 and uint64 where we have an overflow issue in go-ipld-prime so you get an error, but in extended CBOR you can have arbitrary sized integers using tags. There is a history of people throwing weird CBOR into DAG-CBOR but we've really tightened our codecs in the last couple of years to make that uncomfortable.

If things work as they should, then you'll just get errors when blocks can't be properly decoded, hopefully those error messages are helpful.

core/corehttp/gateway_handler.go Fixed Show fixed Hide fixed
core/corehttp/gateway_handler.go Fixed Show fixed Hide fixed
@hacdias
Copy link
Member Author

hacdias commented Oct 13, 2022

@rvagg thanks for the comment. I simplified the code to just output the raw bytes in case the data is already encoded in the output codec, reducing possible CPU intensive operations.

Note: I can probably simplify it further in order to use the same code as the raw block handler, as long as I can set the correct ContentType.

@hacdias
Copy link
Member Author

hacdias commented Oct 20, 2022

@rvagg @lidel so I noticed something that may influence whether or not we actually spit out the original bytes iff the requested codec is the same as the CID codec (also ref: ipfs/specs#328 (comment)). Let's take the following json document and add it to IPFS:

$ ipfs dag put --input-codec json --store-codec json
{
	"foo": {
		"bar": "hello, world"
	}
}
> bagaaieraev5tujfaql3zxtvubhbuf5tvt33rn32wefgujw6pfa2v3ybiai6a

When you use ipfs dag get to retrieve it, you can actually traverse the paths, which can be quite useful:

$ ipfs dag get --output-codec json /ipfs/bagaaieraev5tujfaql3zxtvubhbuf5tvt33rn32wefgujw6pfa2v3ybiai6a/foo/bar
"hello, world"

This is possible thanks to this little bit of code we have that traverses the node after running api.Dag().Get():

if len(rp.Remainder()) > 0 {
remainderPath := ipld.ParsePath(rp.Remainder())
finalNode, err = traversal.Get(finalNode, remainderPath)
if err != nil {
return err
}
}

However, in the gateway, if we use api.Block().Get() and spit out its output if CID codec = response codec, we won't get traversal. Therefore, fetching /ipfs/bagaaieraev5tujfaql3zxtvubhbuf5tvt33rn32wefgujw6pfa2v3ybiai6a/foo/bar and /ipfs/bagaaieraev5tujfaql3zxtvubhbuf5tvt33rn32wefgujw6pfa2v3ybiai6a on the gateway would produce the exact same output.

I think we should do the traversal as it will match the current behaviour of ipfs dag get and it will allow users to traverse JSON and CBOR documents using the gateway. If we don't do it, the UX will be weird in the sense that multiple, valid, paths in the gateway will return the same exact output.

[Whatever decision comes out of this must be added to the IPIP for clarity]

@rvagg
Copy link
Member

rvagg commented Oct 20, 2022

If we want this to work as if its the DAG API then yeah, we should probably go the whole way on supporting pathing, and therefore re-encoding sub-blocks. Is that the intention here (sorry, not being as familiar with the gateway code myself)? Does it make sense to use the DAG API for all of it, or only when you path?

Also note that there's subtle differences in pathing between /ipfs/ (or bare I think?) and /ipld/ prefixes for the DAG API so we'll have to also address that—for historical reasons, pathing will drop to named link pathing for dag-pb blocks unless you /ipld prefix. So when you have a dag-pb block (let's say Qmaaaa) that looks like {Links:[{Name:"foo",Hash:Qmbbbb...}]} and you dag get Qmaaaa/foo then it'll resolve through to whatever Qmbbbb is. But if you dag get /ipld/Qmaaaa/Links/Name then it'll do proper IPLD pathing through the block to return "foo".

I'm pretty sure UnixFS pathing is the intended behaviour of the gateway, so this would be the default for dag-pb input blocks. Does that make sense even when you change output codec? I guess so 🤷 but we should probably be clear about it since it's one of those weird edges that history provides us with.

@hacdias
Copy link
Member Author

hacdias commented Oct 20, 2022

Is that the intention here (sorry, not being as familiar with the gateway code myself)? Does it make sense to use the DAG API for all of it, or only when you path?

When you request something normally through the gateway, it uses the UnixFS API. This PR adds (DAG-)JSON and (DAG-)CBOR support. We're always using the DAG API unless the codec in the CID is the same as the requested format. In that case, we just print the block bytes.

However, that will make it not support pathing. We could, of course, add more ifs_ to make the logic flow through the DAG API if there's paths but I feel that would make this implementation unnecessarily complicated.

I'm pretty sure UnixFS pathing is the intended behaviour of the gateway, so this would be the default for dag-pb input blocks.

Yes, that's how it works.

Does that make sense even when you change output codec? I guess so 🤷 but we should probably be clear about it since it's one of those weird edges that history provides us with.

I think it makes sense. We should, however, note that there's also an IPLD gateway spec going on ipfs/specs#293. I wonder how much of overlap this would have.

I think we should keep the same behaviour as ipfs dag get under the /ipfs namespace:

  • If original codec is dag-pb: do UnixFS pathing.
  • If original codec is dag-json, dag-cbor, json, cbor: do regular pathing (or following links for DAG-).

But this will involve not printing the raw bytes if CID codec = requested codec (basically removing these lines).

@lidel
Copy link
Member

lidel commented Oct 20, 2022

  • If original codec is dag-pb: do UnixFS pathing.
  • If original codec is dag-json, dag-cbor, json, cbor: do regular pathing (or following links for DAG-).

Would it be safer if we supported pathing only for traversable (dag-) codecs?
Can we keep json and cbor boring, just return them as-is (if there is a path, then error 404), and do pathing only for DAG- variants?
This would force people to use DAG- variants for advanced things and keep the specs, fixtures, and API surface smaller.

I don't think Gateway should be responsible for modifying plain JSON docs that have nothing to do with DAGs, nor have strictness of DAG-JSON. Feels safer: if I am wrong, we can add it in the future, but its easier to add it later than remove somehting we are not happy with.

Q: why can't we return raw bytes if CID codec == requested DAG- codec and there is no sub-path?
Sounds like a sane optimization for the cost of additional IF added before passing things to expensive DAG API?

@hacdias
Copy link
Member Author

hacdias commented Oct 21, 2022

Can we keep json and cbor boring, just return them as-is (if there is a path, then error 404), and do pathing only for DAG- variants?

We can. I'll add that to the TODOs.

Why can't we return raw bytes if CID codec == requested DAG- codec and there is no sub-path?

We can. But what if the CID codec is a DAG- codec, but the raw data inside is invalid? E.g., added via ipfs block put. I was under the impression that we would error on invalid data. But this logic goes to all codecs.

@lidel
Copy link
Member

lidel commented Oct 21, 2022

But what if the CID codec is a DAG- codec, but the raw data inside is invalid? I was under the impression that we would error on invalid data. But this logic goes to all codecs.

You are right (I still have brain fog sometimes).
We should follow spec and validate DAG-, always. 👍

(If someone wants better performance, they can skip validation by requesting it as raw – as we do in ipfs-geoip, which does dag-json parsing in userland via @ipld/dag-cbor)

@rvagg
Copy link
Member

rvagg commented Oct 23, 2022

Can we keep json and cbor boring, just return them as-is (if there is a path, then error 404), and do pathing only for DAG- variants?

Is this for output codecs? So the dividing line on pathing would be if the codec has dag-* in front of it then it'll path? Or is it the division that we support pathing for IPLD codecs—ones that materialise links. The others, which we've called "serializations", like json and cbor, don't allow pathing.

Justifying this is a bit tricky though because it seems a bit arbitrary that we do for some codecs and not others. Maybe it's something like: "we support pathing for codecs that materialise links (CIDs) because the most practical application of pathing is to navigate across block boundaries." But note that we could even take that further and only allow pathing where it ends in a link. Still in this arbitrary place of why we support one thing and not another though.

@lidel
Copy link
Member

lidel commented Oct 23, 2022

"we support pathing for codecs that materialise links (CIDs) because the most practical application of pathing is to navigate across block boundaries."

+1 – my thinking was that the safe way to specify things for now is DAG traversal over CID links (CBOR tag 42), and nothing more.

Rationale is that it makes implementer's work easier: can be done with plain CBOR libs if needed, traversing links between blocks is easy to reason about, and does not require any advanced IPLD knowledge.

On Gateways we have a bunch of headers and behaviors are based on the CID of the last path segment, and if we start extracting parts of CBOR or JSON documents and return them via pathing, then we need to specify how headers such as Etag etc look like for such synthetic responses, which baloons complexity.

My take is that it is easier to add things in the future, than remove, so better to start small.
Not saying advanced pathing is bad idea: we will have opportunity to re-evaluate when we add support for IPLD Patch via HTTP POST.

.dag- veriants are not necessary – DAG-JSON should have 'Save As'
dialog default to file named .json

see "File extension" section of specs below:

https://www.iana.org/assignments/media-types/application/vnd.ipld.dag-json
https://www.iana.org/assignments/media-types/application/vnd.ipld.dag-cbor
Quality of life improvement, allows user agents to render JSON.
Passing ?format=json to unixfs directory will show JSON in browser.
Includes human-readable error which encourages people to use CBOR Tag 42
for building DAGs with Links.
This ensures user knows why they got this HTML response (due to specific
codec in the specific CID they can inspect), and that they
have clear understanding of options – either preview as JSON, or
download as one of supported formats.
This is a quick, static placeholder that follows the style of dir-index-html.
Future work will refine this, perhaps by reusing parts of https://explore.ipld.io/
@lidel
Copy link
Member

lidel commented Nov 24, 2022

Quick update: still in review, I am mostly adding HTTP response tests to safeguard newly added behaviors (headers, caching). I should be done by tomorrow/monday. 🤞

This unifies the way we set headers for DAG- responses, adds missing
ones for HTML response, and adds bunch of regression tests to ensure
desired behaviors are not changed during future refactors.
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hacdias took this for a spin and looks good. This will have huge impact long term, thank you! 🙏

Pushed some changes around HTTP headers and wrote additional tests:

  • JSON response is now inlined, which allows browsers to render preview instead of forcing the "Save As" dialog.
  • Unsupported pathing returns more meaningful error + error code 501 Not Implemented
  • Unified the way headers are set + added Cache-Control and Etag code
  • Added regression tests for some edge cases and common behaviors
  • Tweaked HTML response to be more presentable and "good enough" for MVP, Kubo 0.18.0-rc1 and production use on public gateways.

Some additional info comments below.

core/corehttp/gateway_handler_codec.go Outdated Show resolved Hide resolved
core/corehttp/gateway_handler_codec.go Outdated Show resolved Hide resolved
@@ -142,6 +142,11 @@ test_expect_success "Add CARs for path traversal and DAG-PB representation tests
test_should_contain $DAG_PB_CID import_output
'

test_expect_success "GET DAG-JSON with Accept: text/html returns HTML" '
curl -sD - -H "Accept: text/html" "http://127.0.0.1:$GWAY_PORT/ipfs/$DAG_JSON_TRAVERSAL_CID" > curl_output 2>&1 &&
test_should_contain "Content-Type: text/html" curl_output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ We were missing Cache-Control and Etag header tests – added them in 7e84856

  • unified the way we set headers for DAG- responses, and added missing ones for HTML + bunch of tests
  • we needed separate tests for HTML responses because it is a different code path AND caching needs to be different: generated HTML may change in future Kubo, so it is not immutable response, even when comes from /ipfs
    • the logic for Cache-Control/Etag for HTML response is similar to ones from dir-index-html responses produced by gateway_handler_unixfs_dir.go (see t0116-gateway-cache.sh for prior examples).

Future work:

  • potentially cool project would be to create a single view that handles both unixfs dir listings and non-unixfs dags, and shows more useful info about the DAG (I'm thinking "https://explore.ipld.io 2.0" :)) – not urgent, but somehting we would could start designing in cooperation with gui team.
  • if we evern change how dir-index-html responses behave, the analog changes should be applied to dag-index-html.

Comment on lines 119 to 121
func (i *gatewayHandler) serveCodecHTML(ctx context.Context, w http.ResponseWriter, r *http.Request, resolvedPath ipath.Resolved, contentPath ipath.Path) {
codecName := mc.Code(resolvedPath.Cid().Prefix().Codec).String()
body := fmt.Sprintf(`<!DOCTYPE html>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ Pushed some changes in 0531e93 around the HTML generated here:

It looks like this:

Screen Shot 2022-11-24 at 17 34 52

We can improve it further in follow-up PRs, but it is good enough for 0.18.0-rc1.

Copy link
Member

@lidel lidel Nov 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @juliaxbow @SgtPooki – FYI this PR adds a stub/placeholder for HTML returned when a web browser asks gateways for things other than files and directories.

A Potential GUI project for 2023 would be to design a unified experience that can render both directory listings (example) and non-unixfs things like DAG-CBOR / DAG-JSON, and provide more information about DAG.
Something similar to https://explore.ipld.io (example), but less intimidating.

What would be the best place for tracking that? Should I create issue in https://github.com/ipfs/ipfs-gui?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely looks much better than the page I had 😃 Thanks for this!

Copy link
Member

@lidel lidel Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed "This CID is not UnixFS" (f084f09 – suggested by @b5 during IPFS Implementers sync.

lidel added a commit to ipfs/specs that referenced this pull request Nov 25, 2022
@lidel lidel changed the title feat(gateway): JSON and CBOR response formats, IPIP 328 feat(gateway): JSON and CBOR response formats (IPIP-328) Nov 25, 2022
hacdias and others added 6 commits November 28, 2022 15:57
Applied feedback from IPFS Implementers Sync,
HTML should be aimed at non-technical people,
and UnixFS in bold is too intimidating.
@lidel
Copy link
Member

lidel commented Dec 5, 2022

I believe this is ready to be included in 0.18.0-rc1 (#9417) – will merge after CI passes.
Thank you @hacdias!

@lidel lidel enabled auto-merge (squash) December 5, 2022 17:07
@lidel lidel merged commit fdd1965 into master Dec 5, 2022
@lidel lidel deleted the feat/8823 branch December 5, 2022 20:34
hacdias added a commit to ipfs/boxo that referenced this pull request Jan 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Gateway support for dag-json and dag-cbor responses
3 participants