Skip to content
This repository has been archived by the owner on Apr 3, 2024. It is now read-only.

HTML-wrapped code blocks contents are parsed #363

Open
mgol opened this issue May 5, 2017 · 4 comments · May be fixed by #367
Open

HTML-wrapped code blocks contents are parsed #363

mgol opened this issue May 5, 2017 · 4 comments · May be fixed by #367

Comments

@mgol
Copy link

mgol commented May 5, 2017

GitHub's Markdown parser doesn't seem to parse backticks-contained text even when put inside of HTML; the following Markdown:

Some content

<div>`<style>`</div>

More content - invisible.

will render just fine while marky-markdown cuts everything off after the <style> part. This broke readme of Webpack's style-loader; see https://www.npmjs.com/package/style-loader & webpack-contrib/style-loader#227.

@revin
Copy link
Collaborator

revin commented May 5, 2017

Interesting case! Thanks for reporting 👍

What's happening here is, everything in the HTML block is being left alone, and since the block is basically malformed HTML, the second backtick gets interpreted as being style content inside the <style> tag, and then the HTML santizer is stripping out <style> elements, so you only end up with <div>`</div> HTML output.

I verified this in the live marky tester by changing <style> to <br>, <hr>, etc… Those tags are allowed by the sanitizer, so everything renders as HTML. If I turn the sanitize option off, the <style> makes it through OK, but since the < and > aren't escaped, the browser renders the tag as being invisible, like a normal <style> tag. Whew!

GitHub's behavior here is kind of fascinating. At least in a gist, <br> and <hr> inside backticks render the same way marky does—the tags just get interpreted as normal HTML. <style> renders as you saw above, but <body> just disappears, <head> disappears, <blargh> disappears…like they just render as consecutive backticks. In fact, we can even eliminate the backticks, and GitHub renders<style> as though it was escaped properly. Looks like the backticks are a red herring? 😳

At first blush, it almost looks like <style> is some kind of special case. I'm going to have to do a more exhaustive test to see which tags behave similarly, and we'll have to match behavior.

@revin
Copy link
Collaborator

revin commented May 5, 2017

I created a demonstration gist showing the rendering results of embedding different HTML tags. Looks like we need to special-case the following tags:

  • <iframe>
  • <script>
  • <style>
  • <textarea>
  • <title>

@mgol
Copy link
Author

mgol commented May 6, 2017

Great analysis! You're right backticks were a red herring, I stripped them all from your gist and results are the same.

@revin
Copy link
Collaborator

revin commented May 15, 2017

Looks like most of these are fairly straightforward to implement, but there are a couple things to note:

  • Our tests explicitly check that <script> is allowed when we're executed with sanitize: false. Nothing magic about <script> in particular; it's just an example of something that would be normally stripped out by the sanitizer. So I'm thinking maybe since turning the sanitizer off is, in a way, opting out of 100% strict GH compat, what about skipping this HTML escaping process in the case of sanitize: false?
  • The sanitizer is configured to strip iframes unless they're pointing to youtube URLs. IIRC we still need that capability because the npm docs have embedded YT vids. Is that still the case? It looks like GH compat is to always escape <iframe> tags in HTML blocks no matter what the src points to. Should we implement the GH version, and allow for the {YT-only, unescaped} version via some combination of options?

@ashleygwilliams any thoughts?

@revin revin added the question label May 15, 2017
@revin revin linked a pull request May 16, 2017 that will close this issue
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants