Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing files in remark-cli macOS pattern resolution #52

Closed
4 tasks done
karlhorky opened this issue Sep 28, 2024 · 23 comments
Closed
4 tasks done

Missing files in remark-cli macOS pattern resolution #52

karlhorky opened this issue Sep 28, 2024 · 23 comments
Labels
🙋 no/question This does not need any changes 👎 phase/no Post cannot or will not be acted on

Comments

@karlhorky
Copy link
Contributor

karlhorky commented Sep 28, 2024

Initial checklist

Affected packages and versions

[email protected], [email protected], [email protected]

Link to runnable example

https://github.com/karlhorky/repro-remark-lint-no-dead-urls-folder-patterns

Steps to reproduce

  1. Check out the reproduction at https://github.com/karlhorky/repro-remark-lint-no-dead-urls-folder-patterns
  2. Check out the failing test run on macos-latest (GitHub Actions workflow file) and observe the missed files in patterns remark . ..., remark ./src/pages/ ..., remark src/pages/**/*.mdx ... etc on macOS (only paths to singular files seem to be working on macOS)

This doesn't happen on ubuntu-latest

This didn't happen in versions before [email protected]

Runtime: Node.js latest LTS (20.17.0)

Expected behavior

Patterns are resolved to multiple files

Actual behavior

Patterns are resolved to a single file or zero files

Alternatives Considered

Maybe this is a general problem in remark-cli upstream? (eg. maybe this issue should be transferred to remarkjs/remark issues instead?)

Runtime

Other (please specify in steps to reproduce)

Package manager

pnpm

OS

macOS

Build and bundle tools

No response

@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Sep 28, 2024
@ChristianMurphy
Copy link
Member

ChristianMurphy commented Sep 28, 2024

Hey @karlhorky! 👋
remark-cli by default only matches markdown extensions.
https://github.com/remarkjs/remark/blob/97e6f149976e486b5008df073d27f0507d0e58a4/packages/remark-cli/cli.js#L32
the list of default extensions is https://github.com/sindresorhus/markdown-extensions/blob/797fc3dda4ecc9aab668bfdd745f570593f1581d/index.js#L4-L11

If you want to include MDX add the --ext mdx option to your matcher commands.
Hope this helps!


This doesn't happen on ubuntu-latest

I suspect this is globstar related.
Try setting

shopt -s globstar

On your mac shell.


This didn't happen in versions before [email protected]

This shouldn't be possible.
This plugin isn't path aware.
I strongly suspect you updated remark-cli at the same time.


In addition to address potential follow ups:

@ChristianMurphy ChristianMurphy closed this as not planned Won't fix, can't repro, duplicate, stale Sep 28, 2024
@ChristianMurphy ChristianMurphy added the 🙋 no/question This does not need any changes label Sep 28, 2024

This comment has been minimized.

@github-actions github-actions bot added 👎 phase/no Post cannot or will not be acted on and removed 🤞 phase/open Post is being triaged manually labels Sep 28, 2024
@karlhorky
Copy link
Contributor Author

karlhorky commented Sep 28, 2024

This didn't happen in versions before [email protected]

This shouldn't be possible. This plugin isn't path aware. I strongly suspect you updated remark-cli at the same time.

Nope, the PR only updated from [email protected] to [email protected]

Screenshot 2024-09-28 at 16 58 37

Screenshot 2024-09-28 at 16 57 44

Screenshot 2024-09-28 at 17 01 06

Been working for literally years before the bump to [email protected]

Can create a PR to my reproduction repo if you need hard proof.

@karlhorky
Copy link
Contributor Author

karlhorky commented Sep 28, 2024

If you want to include MDX add the --ext mdx option to your matcher commands.

Looks like --ext mdx is what is missing, thanks!

karlhorky/repro-remark-lint-no-dead-urls-folder-patterns#1

So it looks like, on macOS (without any further configuration or installing anything):

  1. ., ./src/pages/ and "src/pages/**/*.mdx" resolve all files correctly (incl. nested files) ✅
  2. src/pages/**/*.mdx (without quotes) does not resolve files correctly (misses nested files) ❌

So on macOS, the quotes are important for remark-cli

@ChristianMurphy
Copy link
Member

Nope, the PR only updated from [email protected] to [email protected]

@karlhorky your screenshot shows 8 files being updated, it's doing a lot more than just updating a dependency.

So on macOS, the quotes are important for remark-cli

When the glob isn't quoted it is handled by your shell.
You are using a bash shell without globstar enabled, if you want to use unquoted globs.
Enabled globstar

shopt -s globstar

@karlhorky
Copy link
Contributor Author

your screenshot shows 8 files being updated, it's doing a lot more than just updating a dependency.

Those are other files that I changed myself, also not related to updating any dependencies. It was me trying different options to fix the problem of files being missed. But I tested it first without those other unrelated changes.

@ChristianMurphy
Copy link
Member

I still strongly suspect that remark-cli/unified-engine/glob were updated through that PR.
Given that the search you did shows remark-cli is in the diff somewhere off screen from what you did a screenshot of, it is still the most likely case.

@karlhorky
Copy link
Contributor Author

karlhorky commented Sep 28, 2024

It was in the previous package.json screenshot, that was the 1 occurrence (the line was unchanged)

But maybe something else caused a different remark-lint version to be used, I can't be 100% certain of that.

Not in the diff of the PR though.

@karlhorky
Copy link
Contributor Author

karlhorky commented Sep 28, 2024

I also tried out options.extensions to see if I can avoid having --ext mdx in the CLI command, but it looks like that's not possible in a Remark config (maybe I have the wrong types for the Remark config):

remark-lint-no-dead-urls.mjs

/** @type {Omit<import('unified-engine').Options, 'processor'>} */
const config = {
  extensions: ['mdx'],
  plugins: [
    'remark-mdx',
    [

@ChristianMurphy
Copy link
Member

Looks like --ext mdx is what is missing, thanks!

I'm glad it works.
I'm a bit surprised it ever worked without that option.

I also tried out options.extensions to see if I can avoid having --ext mdx in the CLI command, but it looks like that's not possible in a Remark config (maybe I have the wrong types):

options.extensions is part of creating a CLI https://github.com/remarkjs/remark/blob/97e6f149976e486b5008df073d27f0507d0e58a4/packages/remark-cli/cli.js#L30-L43
It is not part of a configuration file.

The type for a config is Preset https://github.com/unifiedjs/unified-engine/blob/main/readme.md#preset

@karlhorky
Copy link
Contributor Author

Looks like --ext mdx is what is missing, thanks!

I'm glad it works. I'm a bit surprised it ever worked without that option.

Yeah, maybe what the old version of remark-cli did was read the pattern for .<extension> and add that to the extensions option...

@karlhorky
Copy link
Contributor Author

karlhorky commented Sep 28, 2024

The type for a config is Preset unifiedjs/unified-engine@main/readme.md#preset

Thanks! I'll take a look at using this type instead. Does it also have the frail boolean property? It's not documented at that section... But maybe it's somewhere else, eg. under import('unified').Settings

Edit: hm... import('unified').Settings is a very permissive type:

  /**
   * Configuration passed to a Plugin or Processor
   */
  interface Settings {
    [key: string]: unknown
  }

Would be great to have some type checking for typos like fraiil: true 🤔 Also, seems that it's not happy with top-level frail property

/** @type {import('unified-engine').Preset} */
const config = {
  settings: {
    frail: true,
    fraiil: true, // 💥 No error on typo
  },

  frail: true, // 💥 Object literal may only specify known properties, and 'frail' does not exist in type 'PresetSupportingSpecifiers'.

Edit 2: I think that's why I went with the original type of Omit<import('unified-engine').Options, 'processor'> - because it has the frail property and other things too:

https://github.com/unifiedjs/unified-engine/blob/0266c1abe3c54b736bae3f73b9f235083b466736/lib/index.js#L72-L74

So I think I'll stick with Omit<import('unified-engine').Options, 'processor'> for now, after I confirm that frail: true at top level is working...

Edit 3:

It seems that frail: true is not a thing for a Remark config file, doesn't trigger the same behavior as the --frail config flag. So I'll switch back to the import('unified-engine').Preset type.

@ChristianMurphy
Copy link
Member

hm... import('unified').Settings is a very permissive type:

Settings are per-pipeline, and depend a lot on the parser and compiler used.
There also can be multiple distinct pipelines in a project, so attaching the type to a globally shared setting interface (similar to how mdast and hast can be extended with new types), would be less viable than the node type scenario.
I suspect trying to infer these will run into similar issues as unifiedjs/unified#221

@karlhorky
Copy link
Contributor Author

karlhorky commented Sep 28, 2024

Weird, so in my reproduction repo (ubuntu-latest), it's finding the files in all cases except src/pages/**/*.mdx without quotes

But then in our private monorepo on GitHub Actions (many more links), it succeeds silently without any output, although I know that there are broken links contained in some of the files:

$ remark "./src/pages/**/*.mdx" --ext mdx --frail --rc-path remark-lint-no-dead-urls.mjs
Done in 57.25s.

The broken links show up on my local macOS development environment though 🤔

$ remark "./src/pages/**/*.mdx" --ext mdx --frail --rc-path remark-lint-no-dead-urls.mjs
src/pages/x/n/d/thank-u-next/index.mdx
486:3-486:116 warning Link to https://chrome.google.com/webstore/detail/wildfire/djhgeeodemlfdpmcccdekfalbhllcoim/ is dead                    no-dead-urls remark-lint

⚠ 24 warnings
error Command failed with exit code 1.

I've tried patterns with ./src/pages/, ./src/pages, all with the --ext mdx ... seems like there's few differences left that I can think of (except argument order in the command, which I'm assuming doesn't matter).

I also thought it could be a path length issue, but that didn't cause the repro to stop working:

Would be cool to be able to show the full list of URLs tested, even if they are passing - I tried remark --verbose for this, but it didn't change anything about the output.

Maybe that's a feature request for remark-lint-no-dead-urls though...

@ChristianMurphy
Copy link
Member

it's finding the files in all cases except src/pages/**/*.mdx without quotes

Even on ubuntu, this is controlled by your shell.
The distro being used may not have globstar enabled.
run:

shopt -s globstar

But then in our private monorepo on GitHub Actions (many more links), it succeeds silently without any output, although I know that there are broken links contained in some of the files:

Check how you are running the script, and what the current wording directory cwd is.
It sounds like you have the script pointed to the wrong folder.

Would be cool to be able to show the full list of URLs tested

I'm pretty sure the issue is folder related, rather than URL.
Adding this feature wouldn't really help you debug the issue you are having.

@karlhorky
Copy link
Contributor Author

karlhorky commented Sep 28, 2024

Interesting:

  1. with a single file path (eg. src/a/b/c/d/index.mdx), it works in the monorepo, showing the error message (although it also puts out the file content to stdout, maybe this is a standard behavior of remark-cli with a single file?)

Run time: 5m 20s

$ remark src/pages/tech-fundamentals-foundations-immersive/modules/thank-u-next/index.mdx --ext mdx --frail --rc-path remark-lint-no-dead-urls.mjs
***

title: "Thank U, Next"
sortOrder: 11
curriculumModuleCategory: '{"id":2,"title":"Lectures","symbol":"🧑‍🏫"}'
------------------------------------------------------------------------

# Thank U, Next

... more file content here ...

...

486:3-486:116 warning Unexpected dead URL `[https://chrome.google.com/webstore/detail/wildfire/djhgeeodemlfdpmcccdekfalbhllcoim/`](https://chrome.google.com/webstore/detail/wildfire/djhgeeodemlfdpmcccdekfalbhllcoim/%60), expected live URL                                                                                                                                             no-dead-urls remark-lint
  [cause]:
              error   Unexpected not ok response `404` (`Not Found`) on `[https://chromewebstore.google.com/detail/wildfire/djhgeeodemlfdpmcccdekfalbhllcoim/`](https://chromewebstore.google.com/detail/wildfire/djhgeeodemlfdpmcccdekfalbhllcoim/%60)                                                                                                                                   dead         dead-or-alive

⚠ 13 warnings
error Command failed with exit code 1.
  1. it starts failing as soon as I remove the index.mdx (eg. src/a/b/c/d/), although I think it's still linting the files, because it still takes a long time, just doesn't show the output and doesn't exit with code 1

Run time: 10m 27s

$ remark src/pages/tech-fundamentals-foundations-immersive/modules/thank-u-next/ --ext mdx --frail --rc-path remark-lint-no-dead-urls.mjs
Done in 627.86s.

Not sure why it would take twice as long though, since there is only the index.mdx file in that directory...

@ChristianMurphy
Copy link
Member

ChristianMurphy commented Sep 28, 2024

(although it also puts out the file content to stdout, maybe this is a standard behavior of remark-cli with a single file?)

It is, if you want to turn that off for single files use --no-stdout
https://github.com/unifiedjs/unified-args?tab=readme-ov-file#--stdout

Not sure why it would take twice as long though, since there is only the index.mdx file in that directory...

Usually when it takes significantly longer when using globs/patterns.
There is a hidden/ignored folder like node_modules, .git, or another tool that has a large number of files that would take time to scan.
Taking a long time does not mean it is looking at the folder you are expecting.

@karlhorky
Copy link
Contributor Author

karlhorky commented Sep 28, 2024

Update: I have it working with the folder path now (the one like src/a/b/c/d/ in 2 above)

I did this by removing a large amount of the content in the (long) MDX file.

So it appears that it's related to MDX content in the file - maybe something invalid or some link that is causing remark-cli / remark-lint / remark-lint-no-dead-urls to fail silently when there's more than 1 file specified (but does not fail when only 1 is specified)

@ChristianMurphy
Copy link
Member

ChristianMurphy commented Sep 28, 2024

So it appears that it's related to MDX content in the file

What indicates it is the MDX content specifically, over the length of the file?

Another worth while question, how many URLs are in that document?
And do any of them have rate limiting or DDoS protection applied?

The functional attempts to access all urls in the document simultaneously.

await Promise.all(
urls.map(async function (url) {

With a sufficiently large document you could be triggering your own rate limiting/DDoS protections.

when there's more than 1 file specified

Again, this plugin can literally only see a single file at a time

async function rule(tree, file, options) {

@karlhorky
Copy link
Contributor Author

karlhorky commented Sep 28, 2024

What indicates it is the MDX content specifically, over the length of the file?

I guess it could also be file length (485 lines), but I was not expecting that. To reiterate, the entire file content is not a problem when specifying a path directly to the file eg. remark src/a/b/c/d/index.mdx --ext mdx ...

It appears that the problematic bits are code blocks in the file.

Another worth while question, how many URLs are in that document? And do any of them have rate limiting or DDoS protection applied?

There are 60 URLs in the file, and blocking is not a problem when I use a file path directly to the file.

Again, this plugin can literally only see a single file at a time

Oh maybe I was misunderstanding earlier - are you saying that the directory path pattern I'm using above remark ./src/a/b/c/d/ --ext mdx ... won't work?

Again, this plugin can literally only see a single file at a time

Oh maybe I was misunderstanding earlier - are you saying that the directory path pattern I'm using above remark ./src/a/b/c/d/ --ext mdx ... won't work?

@ChristianMurphy
Copy link
Member

It appears that the problematic bits are code blocks in the file.

Do the code blocks contain a url attribute?

if ('url' in node && typeof node.url === 'string') {

Could you share an anonymized/non-proprietary version of the file?

Oh maybe I was misunderstanding earlier - are you saying that the directory path pattern I'm using above remark ./src/a/b/c/d/ --ext mdx ... won't work?

I'm saying that unified-engine handles passing files one at a time to the plugins.
The number of files shouldn't have an impact on how this plugin works.

To reiterate, the entire file content is not a problem when specifying a path directly to the file

@wooorm could that change the file meta data?
There is one block that depends on the file meta data in this project.

@karlhorky are you testing any relative links which would depend on this?

@karlhorky
Copy link
Contributor Author

karlhorky commented Sep 28, 2024

Interesting, it seems the behavior is "flaky" - when running the command below, sometimes remark-lint-no-dead-urls exits with no output and a 0 exit code, and then other times, it shows the broken links output.

remark src/pages/tech-fundamentals-foundations-immersive/modules/thank-u-next/index.mdx --ext mdx --frail --rc-path remark-lint-no-dead-urls.mjs

The one Chrome extension link https://chrome.google.com/webstore/detail/wildfire/djhgeeodemlfdpmcccdekfalbhllcoim/ is definitely broken always though, so not sure why it would ever report no output with a 0 exit code.

Here is an example of some of the HTML that we have in the code blocks (unusual HTML 3.2 code, teaching about legacy HTML):

16. Right click on the new `legacy` folder and create a new file called `index.html` with the following content:

    ```html
    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
    <html>
    <head>
      <title>TodoMVC</title>
    </head>
    <body>
    <script>
    function rerender() {
        frames[1].location.reload();
    }

    function click(id) {
        for (var i = 0; i < state.numTodos; i++) {
            if (state.todos[i].id == id) {
                state.todos[i].complete = !state.todos[i].complete;
                rerender();
                return;
            }
        }
    }

    function addTodo(text) {
        var nextId = state.nextId;
        state.nextId++;
        state.todos[state.numTodos] = makeTodo(nextId, text, false);
        state.numTodos++;
        rerender();
    }

    function clearCompleted() {
        var newTodos = new Object();
        var newNum = 0;
        for (var i = 0; i < state.numTodos; i++) {
            if (!state.todos[i].complete) {
                newTodos[newNum] = state.todos[i];
                newNum++;
            }
        }
        state.todos = newTodos;
        state.numTodos = newNum;
        rerender();
    }

    function filter(newFilter) {
        state.filter = newFilter;
        rerender();
    }

    function makeTodo(id, label, complete) {
        var todo = new Object();
        todo.id = id;
        todo.label = label;
        todo.complete = complete;
        return todo;
    }

    var state = new Object();
    state.filter = 'all';
    state.todos = new Object();
    state.numTodos = 0;
    state.nextId = 1;
    </script>
    <FRAMESET rows="75,*" border="0" frameborder="no" framespacing="0">
        <FRAME src="header.html" scrolling="no">
        <FRAME name=content src="todos.html" scrolling="no">
    </FRAMESET></body>
    </html>
    ```

Parts of the code are also highlighted red in the HTML highlighting in VS Code, because of the invalid / legacy code:

Screenshot 2024-09-29 at 00 42 08

Maybe I should try running it again 5 or 10 times with the direct path to the file, with unchanged file content - maybe [email protected] just has some generally flaky behavior, and it has nothing to do with the path containing the /index.mdx on the end or just a / ... 🤔

Edit: Since I saw that there were undici paths in the warning / error messages from remark-lint-no-dead-urls, I checked into that, and I found that I had a slightly outdated version in my tree ([email protected] instead of [email protected]). I re-generated my lockfile and am retesting now... first test run is showing failures now. So if this leads somewhere, it opens a new possibility - maybe it is a bug that's already fixed in a transitive dependency...

Edit 2: No, upgrading to [email protected] didn't seem to resolve the problem. [email protected] is still intermittently, silently failing with no stdout (also across different test runs when making no content changes or path changes). I think I'll hold off on upgrading for now, not sure there's much else I can do to debug without diving into the code of [email protected] vs [email protected]

@karlhorky
Copy link
Contributor Author

Ok, that was actually easier than I thought to reproduce - just run the lint command multiple times in the GitHub Actions workflow and it immediately shows the intermittent false negatives / flaky behavior:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🙋 no/question This does not need any changes 👎 phase/no Post cannot or will not be acted on
Development

No branches or pull requests

2 participants