Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comments on URL-interop.md #10

Open
SimonSapin opened this issue Feb 8, 2017 · 5 comments
Open

Comments on URL-interop.md #10

SimonSapin opened this issue Feb 8, 2017 · 5 comments

Comments

@SimonSapin
Copy link

I’m reading it at commit 8636551.

86: must have the scheme present

TWUS: Describes in the 4.2 URL parsing section how a parser should
accept URLs without a scheme.

IIRC the TWUS parser only accepts input without a scheme when there’s a base URL. The input is relative, in these cases.

86 has this grammar, which seems equivalent?

URI-reference = URI / relative-ref

It also there divides parsers into "Non-web-browser implementations"
without specifying how to make that distinction.

In this specific instance, I think "Non-web-browser" means anything that doesn’t also implement https://w3c.github.io/FileAPI/ since the difference between "basic URL parser" and "URL parser" is all about blob: URLs.

TWUS: says a parser must accept one to an infinite amount of slashes

I think this is really not a big deal. It could just as well be 5 max, but 5 is arbitrary and less theoretically pleasing than http://www.catb.org/jargon/html/Z/Zero-One-Infinity-Rule.html

Real world: 32 bit numbers occur, and are automagically supported if
typical OS level name resolver funcitons

When I looked into it, it seemed hard to choose to not support it in such functions. (The most a program could do is recognize such "exotic" IPv4 syntax and reject them with a parse error, if it doesn’t want to resolve the IP address.)

TWUS: Doesn't specify IDNA 2003 nor 2008, but somehow that's still clear

It specified Unicode TR46, which fully defines algorithms independently of IDNA 2003 or 2008. (Though it is based on the Punycode RFC.)

Real world: at least curl and wget2 ignore "rubbish" entered after the number all the way to the next component divider

Personal opinion: it sounds problematic to silently ignore part of the input?

A TWUS URL thus needs other magic to know where a URL ends.

For example in <a href="…"> HTML syntax defines exactly where the value href attribute ends, so there is no need for magic.

If URLs need to be found in the middle of a free-form paragraph of text without any markup, there’s a lot more magic (and heuristics) required than splitting on spaces. I think defining this does not belong in an URL spec.

TWUS has a test suite (that only runs in javacript-enabled browsers).

Part (arguably the most important part) of this test suite has its test cases in a JSON file that can be used without JavaScript (and is in rust-url).

@bagder
Copy link
Owner

bagder commented Feb 10, 2017

IIRC the TWUS parser only accepts input without a scheme when there’s a base URL

Right, clearly wrong of me. That's virtually the same as 86. I took away that mistake.

In this specific instance, I think "Non-web-browser" means anything that doesn’t also implement ...

I suppose that's so too. It got removed as well now when I cleaned up the scheme flaws.

When I looked into it, it seemed hard to choose to not support it in such functions.

Yes, as long as you mean 32bit numbers and you use the stock name resolver functions. The trickier part is the dotted numerical version that isn't 4 numerical fields. But still, that's not part of 86.

Personal opinion: it sounds problematic to silently ignore part of the input?

Both yes and no. When it comes to curl, the original approach was to only interfere where it had to and get everything else as far as it can. So you could send in illegal things in the URL and it would be used in the end anyway, and that could help users torture their servers to send crap other clients wouldn't.

Over time that has turned out harder and a bit error-prone so we've had to make the parser stricter over time, but it still has a fairly lenient approach and the focus is that if you pass it a legal URL it should parse it and work it it. The illegal URLs are not always rejected (sort of garbage in, garbage out). But over time I think we're slowly rejecting more and more illegal URLs.

For example in <a href="…"> HTML syntax defines exactly where the value href attribute ends, so there is no need for magic.

Right, but when you accept a white space as a part of a URL, you need something else or another character to specify that. In HTTP headers that other character is typically a newline. If it is within a <a> tag, I suppose the HTML parser would pass on the length.

I should avoid the use of "magic" there and say another method or another character.

Part (arguably the most important part) of this test suite has its test cases in a JSON file

I drowned in all the other things there when I looked previously but I agree that it looks fine. I've pushed a change now that links directly to the source json file.

Thanks for all the feedback. I've done several commits now to clean up.

@bagder
Copy link
Owner

bagder commented Feb 10, 2017

It specified Unicode TR46, which fully defines algorithms independently of IDNA 2003 or 2008. (Though it is based on the Punycode RFC.)

I'm useless when it comes to anything non-ascii so I suppose that's why I'm extra confused by all these IDNA things.

Are you saying that the TR46 document makes it clear to you how to encode IDN host names when doing name resolves and then works with everything, including German ß's?

@SimonSapin
Copy link
Author

SimonSapin commented Feb 15, 2017

Are you saying that the TR46 document makes it clear to you how to encode IDN host names when doing name resolves

Yes.

and then works with everything, including German ß's?

If you mean "Is implementing that spec sufficient for achieving interoperability with every domain, TLD, and registar in the world", I don’t know. I assume Anne chose TR46 over alternatives because he thought it would provide better, if not perfect, interop.

I just googled for what he wrote about this and found:

https://annevankesteren.nl/2014/06/url-unicode

The reasoning is that it provides an interface compatible with IDNA2003, including almost identical processing, but is based on the IDNA2008 dataset.

@bagder
Copy link
Owner

bagder commented Mar 6, 2017

If you mean "Is implementing that spec sufficient for achieving interoperability with every domain, TLD, and registrar in the world", I don’t know

Then I would say that it isn't that clear to you either. A clear spec would specify the single algorithm that should be used. (And should doesn't then mean that everyone adheres to that spec, just like any other spec in the world.)

@SimonSapin
Copy link
Author

You’re either misunderstanding or misrepresenting what I wrote. TWUS does specify a single algorithm that, in the opinion of its editors, should be used.

My “I don’t know” was a response to your “works with everything”, in the sense that “everything” is an unbounded set of things and so that question can never be answered. No single person knows all the corner cases of every piece of software that exist in the world.

However if and when we do find out that some aspect of TWUS doesn’t work with something, we can try and tweak TWUS to fix that problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants