Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: larger MDX files are unmanagably slow to parse #113

Open
robsimmons opened this issue Apr 17, 2024 · 6 comments
Open

Performance: larger MDX files are unmanagably slow to parse #113

robsimmons opened this issue Apr 17, 2024 · 6 comments

Comments

@robsimmons
Copy link

robsimmons commented Apr 17, 2024

Bad news @wooorm - micromark/micromark#169, or something like it, is an issue here as well - I haven't yet grokked your edit maps, but either they don't solve the quadratic-complexity parsing problems or there's a separate performance bug in markdown-rs.

The constant factors are better, but the asymptotic complexity means that we're back to 60-second parse times on files that are just an order of magnitude bigger than the ones that caused 60-second parses in micromark-js.

For comparison, the "JS" lines on these graphs show micromark's performance using subtokenize 2.0.1, which picks up the fix in micromark/micromark#171.

parse time (ms) vs  size(3)

log-log parse time (ms) vs  size(3)

Data collection sources

Run with node main.mjs and cargo run --release

package.json

{
  "dependencies": {
    "mdast-util-from-markdown": "^2.0.0",
    "micromark-extension-mdxjs": "^3.0.0"
  }
}

main.mjs

import { fromMarkdown } from "mdast-util-from-markdown";
import { mdxjs } from "micromark-extension-mdxjs";

/* Generate test data */
const NUM_LINES_IN_CODE_SEG = 10000;
function generateTestData(size) {
  let file = [];
  for (let i = 0; i < size; i++) {
    file.push("", "<DummyComponent code={`");
    for (let j = 0; j < NUM_LINES_IN_CODE_SEG; j++) {
      file.push(crypto.randomUUID());
    }
    file.push("`} />", "");
  }
  return file.join("\n");
}

/* Return the number of milliseconds need to parse a test case */
function testPerformance(size) {
  const file = generateTestData(size);
  const start = performance.now();
  fromMarkdown(file, { extensions: [mdxjs()] });
  const end = performance.now();
  return end - start;
}

for (let reps = 0; reps < 3; reps++) {
  for (let size = 4; size < 40; size = Math.floor(size * 1.35)) {
    console.log(`${size}, ${testPerformance(size)}`);
  }
}

Cargo.toml

[package]
name = "tmp"
version = "0.1.0"
edition = "2021"

[dependencies]
markdown = "1.0.0-alpha.16"

[dependencies.uuid]
version = "1.8.0"
features = [
    "v4",                # Lets you generate random UUIDs
    "fast-rng",          # Use a faster (but still sufficiently random) RNG
]

src/main.rs

const NUM_LINES_IN_CODE_SEG: i32 = 10000;

fn main() -> Result<(), String> {
    for _ in 0..3 {
        let mut reps = 3;
        loop {
            reps = ((reps as f64) * 1.35) as i32;
            if reps > 40 {
                break;
            }
            let mut result = String::new();
            for _ in 1..reps {
                result.push_str("\n<DummyComponent code={`\n");
                for _ in 0..NUM_LINES_IN_CODE_SEG {
                    result.push_str(&uuid::Uuid::new_v4().to_string());
                    result.push('\n');
                }
                result.push_str("`}/>\n\n");
            }
            let start = std::time::Instant::now();
            let mdast = markdown::to_mdast(&result, &markdown::ParseOptions::mdx())?;
            let end = std::time::Instant::now();
            let duration = (end - start).as_millis();
            println!("{},{},{:?}", reps, duration, mdast.position());
        }
    }
    Ok(())
}
@robsimmons
Copy link
Author

Figured out how to get Rust profiling working, and 99% of the time for that benchmark is spent in the add_impl function of src/util/edit_map.rs - it seems like this loop is creating the same quadratic behavior as the repeated slice operations in the JavaScript implementation. Maybe the edit map concept can be modified to avoid this quadratic repeated-splices, but right now it seems like it's merely delaying the quadratic repeated-splices.

@wooorm
Copy link
Owner

wooorm commented Apr 17, 2024

Definitely possible that there are performance improvements to be made in rust and edit maps too! Cool that you’re investigating! Note that here you are specifically also generating strings that relate to JSX and SWC. Might be that there are things happening there.

@robsimmons
Copy link
Author

robsimmons commented Apr 17, 2024

That was worth investigating. What's SWC?

If I comment out

result.push_str("\n<DummyComponent code={`\n");

and

result.push_str("`}/>\n\n");

and change markdown::ParseOptions::mdx() to markdown::ParseOptions::default()), then I'm literally just parsing a file of line-separated UUIDs. The performance is actually worse, though only by a constant factor:

parse time (ms) vs  size(5)
log-log parse time (ms) vs  size(5)

@ChristianMurphy
Copy link
Collaborator

What's SWC?

Speedy Web Compiler https://github.com/swc-project/swc
It handles much of the JS/JSX parsing inside MDX

@robsimmons
Copy link
Author

Thanks for the Speedy Web Response @ChristianMurphy. Yeah, this and the profiling result suggests that it's the edit maps generating the same repeated-splice behavior the JS implementation was dealing with.

@wooorm
Copy link
Owner

wooorm commented Apr 17, 2024

The performance is actually worse, though only by a constant factor:

That makes sense. Because then it has to parse paragraphs that are 10k lines long. There could be many things, links, emphasis, escapes, in there.
A JSX element itself is much simpler.
The expression you pass is somewhat complex too, as it’s JavaScript, and thus means 2 parsers have to work together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@robsimmons @wooorm @ChristianMurphy and others