leave html entities untouched, it breaks pages with weird encodings #112

loreb · 2020-01-23T15:10:46Z

Example: http://www.the-spoiler.com/RPG/New.World.Computing/might.and.magic6.3/mm6.htm
Look at the copyright, it's "©" in the original, monolith translates that to the utf8 copyright character which is wrong because that page says charset=iso-8859-1" so it renders to a bogus character + the copyright sign (tested on windows/linux/openbsd).

I did meet other pages like that, this is just the one where I noticed it.

Testcase:

<html>
<head>
	<!-- comment out the charset to make it work !-->
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
copyright &copy; someone
</body>
</html>

The text was updated successfully, but these errors were encountered:

snshn · 2020-01-23T15:25:47Z

Hi Lorenzo,

Thank you very much for reporting this. It's likely the result of html5ever (the HTML parsing library we're using) treating every document as utf-8 by default and not automatically parsing that charset meta tag.

More info on the likely cause:
servo/html5ever#18

If there's no flag available to preserve html entities, it'll likely be worth substituting those manually for non-utf8 documents upon save.

loreb · 2020-01-23T15:32:48Z

Ouch, a five year old bug... well, thank you for the info and the quick reply!

moriakijp · 2021-01-28T19:32:54Z

Is there any progress on this issue?

snshn · 2021-01-29T04:05:08Z

That's the next big patch I'm currently working on. What encoding is not working in your case, @moriakijp?

snshn · 2021-02-24T09:52:27Z

The fix is now in master. From what I know so far, html5ever (the HTML parser module) parses everything as Unicode, and when monolith re-assembles the document, the original information is lost, the document becomes UTF-8 no matter what charset it originally had. The current solution is to just set all meta charset tags to indicate it's a UTF-8 document. Not ideal but seems to work fine for archiving documents, even though they lose their original encoding.

snshn added the bug Bugs and defects (faults of monolith, not target websites) label May 25, 2020

snshn mentioned this issue Feb 24, 2021

Forcefully set document's charset to UTF-8 #245

Merged

snshn closed this as completed Feb 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

leave html entities untouched, it breaks pages with weird encodings #112

leave html entities untouched, it breaks pages with weird encodings #112

loreb commented Jan 23, 2020 •

edited

Loading

snshn commented Jan 23, 2020

Uh oh!

loreb commented Jan 23, 2020

Uh oh!

moriakijp commented Jan 28, 2021

Uh oh!

snshn commented Jan 29, 2021

Uh oh!

snshn commented Feb 24, 2021

Uh oh!

Uh oh!

leave html entities untouched, it breaks pages with weird encodings #112

leave html entities untouched, it breaks pages with weird encodings #112

Comments

loreb commented Jan 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

snshn commented Jan 23, 2020

Uh oh!

loreb commented Jan 23, 2020

Uh oh!

moriakijp commented Jan 28, 2021

Uh oh!

snshn commented Jan 29, 2021

Uh oh!

snshn commented Feb 24, 2021

Uh oh!

loreb commented Jan 23, 2020 •

edited

Loading