Non-ascii urls fail to validate #57

mazux · 2017-10-06T13:56:55Z

On multi-languages websites, the links maybe in Franch or Romanian languages, so they want be registered because of this filtering process.
The FILTER_VALIDATE_URL, as it defined on php manual website:

only find ASCII URLs to be valid; internationalized domain names (containing non-ASCII characters) will fail.

So, the sitemap will only contain the English links.

The text was updated successfully, but these errors were encountered:

vezaynk · 2017-10-06T21:00:40Z

A temporary fix could be to remove that check as if the URL is not valid, the cURL should fail and not be added to the sitemap anyways.

francisek · 2017-10-07T12:15:24Z

A better way would be to url_encode path parts and convert url to punycode. I can pr something like this

vezaynk · 2017-10-07T12:32:27Z

I am having trouble identifying a source telling me that sitemaps can contain non-ascii characters.

The official specs say that it needs to be a valid URL without defining the RFC.

SignpostMarv · 2017-10-07T12:46:33Z

The official specs say that it needs to be a valid URL without defining the RFC.

https://www.sitemaps.org/protocol.html#escaping

Please check to make sure that your URLs follow the RFC-3986 standard for URIs, the RFC-3987 standard for IRIs, and the XML standard.

not 100% sure those are the relevant RFCs for your use-case ?

vezaynk · 2017-10-07T13:01:50Z

I can't read properly I guess. Yes those are relevant.

How do we know how to encode? It says to encode characters with whatever the server wants. Do we somehow retrieve the servers charset or is there something more universal?

ghost · 2017-10-07T13:05:40Z

Web server is not telling the encoding in http headers?

vezaynk · 2017-10-07T13:14:04Z

It probably does somewhere in the headers. My question was about something more universal which would work everywhere.

Otherwise it will be a game of trying to cover every encoding scheme in active use.

ghost · 2017-10-07T13:16:43Z

You can't predict or limit down encoding schemes. They can be freely set to whatever and passed in http-headers by specifying it in .htaccess

AddCharset UTF-8 .html

vezaynk · 2017-10-07T15:39:43Z

That charset is for document encoding, not URLs. They won't necessarily be the same.

ghost · 2017-10-07T15:45:57Z

IMHO URLs are part of the document so they are encoded with the same charset unless the page/site is completely broken.

vezaynk · 2017-10-07T15:50:36Z

They obviously are. I am not talking about the links in the document, I am talking about the links you are requesting from the server.

If you request http://example.com/â as utf-8 from a server that wants iso-xxxx, it will misread your request.

I may also be misunderstanding it myself.. I'm just a first year CS student.

francisek · 2017-10-07T16:25:44Z

IMO as the spec does not explicitly tells if the url can be raw UTF-8 or whatever else, it would be safest to use full encoded urls and have plain ASCII urls.
The problem is not only present in path or query string, domains may contain extended characters. eg: http://müsic.com/motörhead would be encoded as http://xn--msic-0ra.com/mot%C3%B6rhead

ghost · 2017-10-07T16:26:35Z

You should use the same encoding as the target page/website is using. By default, the url should be encoded with the same encoding as the originating page. This is how browsers work. I see no reason why spider should be any different from this. UTF-8 is fallback if the encoding is missing.

Check this out: https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier

Personally I think that allowing anything else but plain ASCII in URLs is absolute madness...

vezaynk · 2017-10-07T16:38:12Z

You should use the same encoding as the target page/website is using.

That would be really ideal. However, there are many, many encoding schemes each of should be supported if that is the way forward.

IRI and IDN were a mistake and break everything.

Since utf-8 is the fallback and the defacto standard to encode everything these days when ascii is not enough, we could just use it for everything. Servers are smart enough to figure it out, right?

ghost · 2017-10-07T16:41:44Z

I am pretty sure if I configure my web server to accept and send ASCII only, UTF-8 will break the spider if it will enforce UTF-8 and I put some non-ASCII characters to URLs. This can be easily tested if you want, but as you said. UTF-8 with %-encoding for everything is the only universal sane option.

vezaynk · 2017-10-07T16:49:28Z

If you put unicode into a server which only accepts ascii, it breaks regardless of what you do.

For what needs to be done, in summary

encode IDNs with xn-- representation
continue encoding normal ascii urls with % notation
encode IRIs with utf-8 and % where possible

Did I get it right?

ghost · 2017-10-07T16:56:13Z

I would simply do UTF-8 with % for everything except protocol and domain name, and IDNs with xn-- representation. If you mix multiple encoding in URL handling you will not be able to detect duplicates etc. properly.

vezaynk · 2017-10-07T17:01:20Z

In practice it comes down to the same thing. "a" will be "a" in both ascii and utf-8. International characters will be the only thing that will be dealt with with utf-8.

vezaynk added bug/missing feature hacktoberfest labels Oct 6, 2017

vezaynk changed the title ~~Non-ascii urls want registaed or scanned~~ Non-ascii urls fail to validate Oct 7, 2017

francisek added a commit to francisek/Sitemap-Generator-Crawler that referenced this issue Oct 7, 2017

allow to validate non ascii urls (fixes vezaynk#57)

03bc85d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-ascii urls fail to validate #57

Non-ascii urls fail to validate #57

mazux commented Oct 6, 2017

vezaynk commented Oct 6, 2017

francisek commented Oct 7, 2017

vezaynk commented Oct 7, 2017

SignpostMarv commented Oct 7, 2017

vezaynk commented Oct 7, 2017

ghost commented Oct 7, 2017

vezaynk commented Oct 7, 2017

ghost commented Oct 7, 2017

vezaynk commented Oct 7, 2017

ghost commented Oct 7, 2017

vezaynk commented Oct 7, 2017

francisek commented Oct 7, 2017

ghost commented Oct 7, 2017 •

edited by ghost

Loading

vezaynk commented Oct 7, 2017

ghost commented Oct 7, 2017

vezaynk commented Oct 7, 2017

ghost commented Oct 7, 2017

vezaynk commented Oct 7, 2017

Non-ascii urls fail to validate #57

Non-ascii urls fail to validate #57

Comments

mazux commented Oct 6, 2017

vezaynk commented Oct 6, 2017

francisek commented Oct 7, 2017

vezaynk commented Oct 7, 2017

SignpostMarv commented Oct 7, 2017

vezaynk commented Oct 7, 2017

ghost commented Oct 7, 2017

vezaynk commented Oct 7, 2017

ghost commented Oct 7, 2017

vezaynk commented Oct 7, 2017

ghost commented Oct 7, 2017

vezaynk commented Oct 7, 2017

francisek commented Oct 7, 2017

ghost commented Oct 7, 2017 • edited by ghost Loading

vezaynk commented Oct 7, 2017

ghost commented Oct 7, 2017

vezaynk commented Oct 7, 2017

ghost commented Oct 7, 2017

vezaynk commented Oct 7, 2017

ghost commented Oct 7, 2017 •

edited by ghost

Loading