Skip to content
11 changes: 7 additions & 4 deletions packages/block-library/src/cover/index.php
Original file line number Diff line number Diff line change
Expand Up @@ -60,10 +60,13 @@ class="wp-block-cover__image-background"
esc_attr( $object_position )
);

$content = str_replace(
'</span><div',
'</span>' . $image . '<div',
$content
/* Inserts the featured image between the (1st) cover 'background' `span` and 'inner_container' `div`,
Copy link
Contributor

@azaozz azaozz May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the featured image have to be inserted there? Who decided that and for what purpose? Could you please comment to make sure we're on the same page (I'm probably missing some considerations there).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably you didn't read the first line of comment to this PR.
It says:

For the block layout see: https://github.com/WordPress/gutenberg/blob/trunk/packages/block-library/src/cover/edit/index.js#L327-L367

Fell free to take a look at the history of that file: the who and their purpose cannot be really hidden. My 2¢? A henchman of his. 😄

Copy link
Contributor

@draganescu draganescu May 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@azaozz I decided that for not having a better way - as Gutenberg is missing a token system - to make the cover block get the featured image at render time.

If you look at the history of PRs in which I tried to solve the featured image in cover block problem you'll see there are a few constraints that arise from our lack of a token system which results in this kind of render string operations on the block markup.

It is not the only block where these operations occur and I think it also happens (e.g. gutenberg_restore_group_inner_container, render_block_core_image, render_block_core_post_featured_image, render_block_core_template_part, render_block_core_site_logo).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, my bad, should have explained this better. Was thinking that the way this works may have been simpler if the featured image was appended to the end of the block's HTML. The difference in the featured image's node position would have probably been pretty easy to account for with a little of CSS. Even maybe that could have been preferred in some cases as would have made it possible to determine when an image was inserted there.

and removes eventual withespace characters between the two (typically introduced at template level) */
$content = preg_replace(
'/(<span\b[^>]*wp-block-cover__background[\s|"][^>]*>.*<\/span>)\s*(<div\b[^>]*wp-block-cover__inner-container[\s|"][^>]*>)/U',
Copy link
Contributor

@azaozz azaozz May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Processing" HTML with regex sucks no matter how you look at it. The more complex the regex, the more it sucks :)

This looks very fragile. Is there any chance at all to do it in a more compatible and stable manner? One thing that comes to mind is to prepend or append the <img> tag, not attempt to insert it in the middle of some HTML. Trying to "splice" it as the current code does would most likely keep introducing edge cases, will likely break if the block is ever changed by a plugin, and will require ongoing maintenance.

If prepending/appending is impossible, seems there are other options:

  1. Insert an HTML element as a placeholder in the editor. That way the specific location of the featured image can be determined much better, and the replacement code on the PHP side will be a bit more "bulletproof". (The placeholder can be removed when rendering the block and there's no featured image.)
  2. Reconstruct the block's HTML in PHP, i.e. make it really a dynamic block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note on this regex: there's a major jump in required complexity between matching an HTML tag and parsing HTML. Parsing HTML tags is simpler and possible with traditional RegExp patterns and knowledge, but what we've done by attempting to capture the <span> and the <div> together is parsing full HTML. A trivial example input shows how this can break down:

<span class="wp-block-cover__background"><span class="wp-block-cover__background">Something else</span><div class="wp-block-cover__inner-container">Something</span><div class="wp-block-cover__inner-container">

The .* really is problematic, and you probably don't want to have to maintain an appropriate RegExp pattern to do this properly. There's no good way to match that specific </span> with a RegExp, which is why in my example I matched the <div> with its class name, because that's very doable. That is, if we are trying to identify a specific tag based on some attribute, we can only do that by matching the opening tag (or a void tag if there are no children). If we want to match the end of a tag we only have a good option to match "any closing tag of this kind."

Also curious, why do we want to drop the match on the class attribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmsnell

  1. Abut your curisity, take a look:

    /**
    * For themes without theme.json file, make sure
    * to restore the inner div for the group block
    * to avoid breaking styles relying on that div.
    *
    * @param string $block_content Rendered block content.
    * @param array $block Block object.
    * @return string Filtered block content.
    */
    function gutenberg_restore_group_inner_container( $block_content, $block ) {
    $tag_name = isset( $block['attrs']['tagName'] ) ? $block['attrs']['tagName'] : 'div';
    $group_with_inner_container_regex = sprintf(
    '/(^\s*<%1$s\b[^>]*wp-block-group(\s|")[^>]*>)(\s*<div\b[^>]*wp-block-group__inner-container(\s|")[^>]*>)((.|\S|\s)*)/U',
    preg_quote( $tag_name, '/' )
    );

    As you can see, no class= to match the div with class="wp-block-group", and the div withclass="wp-block-group__inner-container"

    And again:

    $replace_regex = sprintf(
    '/(^\s*<%1$s\b[^>]*wp-block-group[^>]*>)(.*)(<\/%1$s>\s*$)/ms',
    preg_quote( $tag_name, '/' )
    );

    No class=. And the matching of a closing tag. And .* between the two.

    As in our case, they had to put something between the two groups.

  2. With your HTML malformed every day usecase, putting a div inside a span:

<span><span>...</span><div>...</span><div>...

my regex would insert the featured image before the last div. While the regex you suggested before the first div.
So what?

It would seem you are maintaining at the same time:

  • "It's a bad practice trying to cotrol what comes before the opening div. That's why I was suggesting no control at all"
  • "I can't accept we are so permissive about what comes (within the span) before the div.We must absolutly control it"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. One thing that comes to mind is to prepend or append the tag, not attempt to insert it in the middle of some HTML.

I could be wrong, but I bet it could accomplished rewriting less than 10% of the entire Gutenberg!

  1. Insert an HTML element as a placeholder in the editor.

Easy: rewite the Cover Block (and probably the Image Block)!

  1. Reconstruct the block's HTML in PHP

More or less this is what has been done by @draganescu. Who has added such a cool feature to WP. I was so impacient that installed Gutenberg for the first time to try it. And I had the misfortune to discover some kind of bug. It's not a real bug. And it wasn't me to discover the nature of this pseudo 'bug'. We have to remove some spaces and line breaks. That's it. If yu have some code, let's do it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@navigatrum first of all I want to say thank for working hard on this. please know that when I look at a proposed change my purpose and goal is to try and find the places where it could break down and so if I seem pessimistic it's because that's how I'm looking at the changes (and all changes for any PR).

Concerning the class question would it be right to understand that you are saying you removed it because there was other code in the project that also doesn't? If so, I wouldn't suggest we need to feel bound to match that. I'm not sure why it was left out there either, and it's not a critical thing, but it just seems surprising that we'd want to intentionally decouple the class name we're looking for from the class attribute we expect to find it in. 🤷‍♂️

my regex would insert the featured image before the last div. While the regex you suggested before the first div.
So what?

I might not have presented a great example because it looks malformed. In this example it should be more apparent that the .* will continue consuming input as long as it can while still matching the entire thing. Even a small change there could prevent a match from continuing past any unexpected tags (for example, if we use [^<]* instead of .* then we at least shouldn't break past unexpected tags that might sit between the <span> and the <div>).

We're probably somewhat shielded here because we're only being supplied the HTML from within the block itself (and also because we added the match on the class name), but if this were applied to the post (or if we were only looking at the tag names) we could easily get into trouble.

It would seem you are maintaining at the same time:

I think there's a misunderstand here. What I'm saying is that we should be aware that whitespace isn't significant in HTML in the way we were treating it and we should look for an approach that doesn't depend on a specific whitespace formatting. We have very few guarantees about how it will reach our code.

When moving to the more robust RegExp there was simply another point I wanted to raise about the risk .* introduces and how it can cause our matcher to accidentally consume significantly more than we expect it to (and perform much slower while doing it!)

It's unfortunate we have more cases where we've already introduced these risks but I believe that's a good enough reason to be on alert to help spot these before they get merged. RegExp patterns can be hard enough, let alone with the complexity of trying to parse HTML…


In summary I think there's several ways to move this forward without throwing much out. Replacing the .* addresses the biggest issue. The point about associating a start tag with its end tag isn't a demand; it's an acknowledgement of how this stuff works unless we build far more complicated patterns (purely due to the recursive nature of HTML).

Copy link
Contributor

@azaozz azaozz May 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the regex again, do you think the span tag matching can be removed? Doesn't seem needed as long as the opening div is matched. Even if there is a possibility to have the div without a span before it, thinking it would be good to insert the featured image before that div. This also removes the need to use .* and match white space between the nodes.

So the regex would be something like: '/<div\b[^>]+wp-block-cover__inner-container(\s|")[^>]*>/U'.
(Note I also changed * to + after the first [^>] as it has to match at least one char.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the logic of the block is:
the image goes between a span with class "wp-block-cover__background" followed by a div with class "wp-block-cover__inner-container". If you remove one or the other, no image. What if I want to put something inside the span which in the react code is empty? Good luck!

To you and @dmsnell since regexes are problematic for HTML parsing, and for a better maintainability, the php porting of the react logic should be reduced to: the featured image goes before a div with class "wp-block-cover__inner-container". Do what you want before that div and good luck!

Or , dsmell variant: if we want to require both the span and the div, at that point we must require an empty span.

Now, just in case for @draganescu and @dmsnell '/<div\b[^>]+wp-block-cover__inner-container[\s|"][^>]*>/U' by @azaozz is acceptable, and given that

$inner_container_start = // above regex;

if ( 1 !== preg_match( $inner_container_start, $content, $matches, PREG_OFFSET_CAPTURE ) {
    // handle this, we can't find our div, maybe another plugin changed it
    return $content;
}

list( $fullmatch, $offset ) = $matches[0];
$content = substr( $content, 0, $offset ) . $image_tag . substr( $content, $offset );

shuold be acceptable by @dmsnell
I'm sending the 4th PR to trim the coverblock (in a few hours)
(Sorry if a wasted your time)

Copy link
Contributor

@draganescu draganescu May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@navigatrum no wasted time at all, quite the reverse it will be a fruitful conversation :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this all seems good, though I should rephrase, as I never wanted to suggest we try and trap the <span>. here's the difference:

  • finding the start of the <div> with a RegExp is normal and doable and not problematic from a parsing standpoint. this is because it isn't parsing HTML at that point. it's only parsing HTML tokens, or lexing the HTML.
  • trying to identify the end of a specific <span> before the <div> is parsing HTML and that's the problematic part with the RegExp and that's the kind of thing we want to avoid (and it's almost always the case to be suspicious when we find ourselves adding .*)

seems like a nuance or a small thing, but there's a profound impact. my comments were more like, "if you are going to match the <span> too here's what you should be aware of).

I still don't understand why we don't want to match on class, but that's not a big issue to me since wp-block-cover__inner-container is unlikely to appear outside of where we expect it. note though, this does match a few quirky possibilities:

  • <div title="Demonstrating how to use the wp-block-cover__inner-container styles">
  • <div class="x-disregard-wp-block-cover__inner-container">
Feel free to ignore matching these, but if you are curious…
"~<div [^>]*\bclass=(\"|')((?!\1).)*\b{$CLASS}\b((?!\1).)*\1( |>)~"

This is almost the same as the one you gave above in your last comment. The main differences are that this:

  • specifically looks for the class attribute
  • makes sure we're only examining the content inside the class attribute by matching on the leading quote and trailing quote, ensuring they are the same, and ensuring we don't scan past the matching one with ((?!\1).)* though I think we could replace that with the simpler [^'"]* since I think quotes aren't allowed as class names.
  • requires that the class name we're searching for is not a part of a bigger string through the surrounding \b boundary markers (that's wrong, should be [\"' ] since the hyphen counts as a word boundary)
  • while we're not matching a closing HTML tag I'm still using ~ instead of / as the PCRE pattern delimiters so we wouldn't have to escape / as \/

with the simplification that would be this instead

"~<div [^>]*\bclass=([\"'])[^\"']*[\"' ]wp-block-cover__inner-container[\"' ][^\"']*\1~"

this is simpler than general HTML attribute parsing because of the rules of CSS class names. hopefully that fulfills your curiosity. feel free to use this or ask questions or ignore it.

"$1$image$2",
$content,
1
);

}
Expand Down