Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New work item: crate r2c2_iri #5

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

New work item: crate r2c2_iri #5

wants to merge 1 commit into from

Conversation

pchampin
Copy link
Collaborator

@pchampin pchampin commented Mar 17, 2025

This is a proposal for new work item. Please provide feedback (as PR reviews, comments or 👍/👎) by 2025-03-31

This "utility" crate is meant to provide:

  • lightweight types wrapping str with the guarantee that it is a valid IRI / IRI reference
  • MAYBE a type allowing to resolve multiple relative IRI references against a fixed base IRI (although it is maybe better to leave this to specific implementations)

I have published an analysis of 4 similar crates (including my own, sophia_iri), and came up with a number of lessons learned, in order to inform the design choices of r2c2_iri. Full disclosure, the conclusions lean toward the design choices of sophia_iri -- in fact, I had performed a preliminary version of this analysis while working on Sophia, so this is not a coincidence.

edited: make the second bullet optional, following @Tpt's comment

@pchampin pchampin added the new-work-item Must label PRs proposing a new work item for the CG. label Mar 17, 2025
@@ -0,0 +1,14 @@
[package]
name = "r2c2_iri"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it seems most crate are using - instead of _ nowdays hence r2c2-iri

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at crates.io, it does not look like one is largely dominating the other...
I personally prefer the underscore (_) to keep the name of the crate consistent with how it is named in code.

But I won't die on that hill.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this seems to be unclear. My bad. I tend to prefer - for consistency with the keys in Cargo.toml I won't die on this hill either, let's choose something and stick with it.

@Tpt
Copy link
Collaborator

Tpt commented Mar 18, 2025

Thank you for this!

I am not sure this is the first work item I would like to tackle. Indeed, IRI resolution and validation seems to me mostly an implementation concern that is not much exposed in public APIs. Hence, I am not sure it will be a key enabler for interoperability. However, it has the advantage of being fairly self contained.

On IRI resolution, it's sadly a bit of unspecified minefield between the RDF approach that is "do not change IRI" and the web browser approach "normalize everything". For example, should resolving foo/../bar against http://example.com/./a/baz return http://example.com/a/bar or http://example.com/./a/foo/../bar or http://example.com/./a/bar... Maybe using one or the other approach is application specific.

I would love to avoid the dependency on regex because it's quite heavyweight. If we move this forward, maybe we can start with regex and then work on a handwritten implementation that might be even faster because it would skip the regex parsing and validator construction cost.

@pchampin pchampin force-pushed the new-work-item-iri branch from 5b8e665 to 27c2ec1 Compare March 18, 2025 12:06
@pchampin
Copy link
Collaborator Author

@Tpt thanks for your comments.

I agree that IRI resolution may be left to implementations.

On the other hand, the common API will need to return "something that is a valid IRI". We could either use a trait for that (something like IsIri: Borrow<str>), or a very simple time (something like struct Iri(str) or struct Iri<T>(T) where T: Borrow<str>). I was leaning towards the 2nd solution, and therefore proposing this work item to define such a type.

Seems to me that the 2nd option makes it a bit easier on users (i.e. developers consuming RDF crates via the R2C2 API), and should be relatively straightforward and uncontroversial (but I may be overly optimistic here).

On IRI resolution, it's sadly a bit of unspecified minefield

Is it really unspecified, or is it just that WHATWG and browsers are trying to push an algorithm that is subtly different from what RFC3986 says? (honest question, I don't have the answer)

[about the regex dependency]

No problem with your proposed 2-staged approach.

@Tpt
Copy link
Collaborator

Tpt commented Mar 18, 2025

the common API will need to return "something that is a valid IRI".

This is an interesting topic. If we enforce in the API that IRI must be valid then it means that code will have to do IRI validation in a lot of places or rely on an Iri::new_unchecked-like method. An other option might be to do like RDF/JS API and don't enforce that the "NamedNode" IRI is valid and use a plain &str there. It would also make easier to keep compatibility with slightly invalid datasets.

Is it really unspecified, or is it just that WHATWG and browsers are trying to push an algorithm that is subtly different from what RFC3986 says? (honest question, I don't have the answer)

There is indeed the WHATWG URL standard that differs from RFC3986 by mandating normalisation of escape sequence and allowing some IRIs processing invalid according to RFC3987 (resolving relative file:// URLs with windows path's \...).

I was more thinking about small issues in RDF and its syntaxes like Turtle that mandate to not touch IRI at all except when doing relative IRI reference resolution that affects also absolute IRIs. See this issue for example. Sorry, I did not reminded that RFC3986 and forgot it always mandate dot segment removal when parsing relative IRIs. Anyway, I tend to think our library can follow RFC3986 resolution algorithm closely and leave this issue up to parser implementations.

@pchampin
Copy link
Collaborator Author

This is an interesting topic. If we enforce in the API that IRI must be valid then it means that code will have to do IRI validation in a lot of places or rely on an Iri::new_unchecked-like method.

Yes. Either the text is known to be a valid IRI, and Iri::new_unchecked should be used, or it is not known to be valid, and then it should be validated.

An other option might be to do like RDF/JS API and don't enforce that the "NamedNode" IRI is valid and use a plain &str there. It would also make easier to keep compatibility with slightly invalid datasets.

I can sympathize with a specific implementation going down that path (that's what mine does, to some extent). But for a common API aiming interoperability, I'd rather stricly follow the standard...

@Tpt
Copy link
Collaborator

Tpt commented Mar 19, 2025

I can sympathize with a specific implementation going down that path (that's what mine does, to some extent). But for a common API aiming interoperability, I'd rather stricly follow the standard...

Make sense! Agreed!

@pchampin
Copy link
Collaborator Author

Thinking a little more about this...

If this crate does not include IRI resolution, and is limited to a wrapper type guaranteeing valid IRI syntax, then it may not make sense to have a separate crate for this: it will be a very small crate, and the common API will need other similar wrappers (e.g. for language tags), so it might just make sense to bundle the IRI wrapper in a bigger crate for terms...

@Tpt
Copy link
Collaborator

Tpt commented Mar 19, 2025

If this crate does not include IRI resolution, and is limited to a wrapper type guaranteeing valid IRI syntax, then it may not make sense to have a separate crate for this

This is an interesting topic. If we merge it into the bigger crate for terms then it means that the possible IRI resolution library will need to depend on the term library. But I am not sure it's a big deal, outside of RDF usage I guess people would be much better off with the url crate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-work-item Must label PRs proposing a new work item for the CG.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants