Skip to content

leave html entities untouched, it breaks pages with weird encodings #112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
loreb opened this issue Jan 23, 2020 · 5 comments
Closed

leave html entities untouched, it breaks pages with weird encodings #112

loreb opened this issue Jan 23, 2020 · 5 comments
Labels
bug Bugs and defects (faults of monolith, not target websites)

Comments

@loreb
Copy link

loreb commented Jan 23, 2020

Example: http://www.the-spoiler.com/RPG/New.World.Computing/might.and.magic6.3/mm6.htm
Look at the copyright, it's "©" in the original, monolith translates that to the utf8 copyright character which is wrong because that page says charset=iso-8859-1" so it renders to a bogus character + the copyright sign (tested on windows/linux/openbsd).

I did meet other pages like that, this is just the one where I noticed it.

Testcase:

<html>
<head>
	<!-- comment out the charset to make it work !-->
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
copyright &copy; someone
</body>
</html>
@snshn
Copy link
Member

snshn commented Jan 23, 2020

Hi Lorenzo,

Thank you very much for reporting this. It's likely the result of html5ever (the HTML parsing library we're using) treating every document as utf-8 by default and not automatically parsing that charset meta tag.

More info on the likely cause:
servo/html5ever#18

If there's no flag available to preserve html entities, it'll likely be worth substituting those manually for non-utf8 documents upon save.

@loreb
Copy link
Author

loreb commented Jan 23, 2020

Ouch, a five year old bug... well, thank you for the info and the quick reply!

@snshn snshn added the bug Bugs and defects (faults of monolith, not target websites) label May 25, 2020
@moriakijp
Copy link

Is there any progress on this issue?

@snshn
Copy link
Member

snshn commented Jan 29, 2021

That's the next big patch I'm currently working on. What encoding is not working in your case, @moriakijp?

@snshn
Copy link
Member

snshn commented Feb 24, 2021

The fix is now in master. From what I know so far, html5ever (the HTML parser module) parses everything as Unicode, and when monolith re-assembles the document, the original information is lost, the document becomes UTF-8 no matter what charset it originally had. The current solution is to just set all meta charset tags to indicate it's a UTF-8 document. Not ideal but seems to work fine for archiving documents, even though they lose their original encoding.

@snshn snshn closed this as completed Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and defects (faults of monolith, not target websites)
Projects
None yet
Development

No branches or pull requests

3 participants