Skip to content

Feat/mini parse 2 alpha#62

Draft
stefnotch wants to merge 7 commits into
feat/mini-parse-reworkfrom
feat/mini-parse-2-alpha
Draft

Feat/mini parse 2 alpha#62
stefnotch wants to merge 7 commits into
feat/mini-parse-reworkfrom
feat/mini-parse-2-alpha

Conversation

@stefnotch
Copy link
Copy Markdown
Contributor

@stefnotch stefnotch commented Feb 3, 2025

tl;dr: I finally got to try out all my mini-parse ideas. 🎉 I am now wondering which of these are worth keeping, and which ones are not.

I tried writing a mini-parse library which

  • keeps track of the input stream type
    • This gives us a typed variation of token(kind: Kind, value: string)
    • In winnow-land, this design also lets them generically operate on strings or binary streams. This isn't useful for us.
    • I am using this for the span combinator, but I'm not convinced that that design is good.
    • tl;dr: It's nifty, but I don't care too much about this.
  • keeps track of whether a parser can backtrack
    • This is used to almost always guarantee "no backtracking". e.g. seq2(tryToken("keyword", "import"), token("symbol", ";")) has the semantics "only the first thing of the sequence parser can backtrack", so only that part will get a if (result == null) return null; check.
    • or parsers assert that their children must be capable of backtracking. Otherwise they'd be useless children.
    • This made me realize that the imports grammar, as written, actually requires a 2 tokens lookahead.
    • However, it makes writing parsers a lot more verbose. See ImportGrammar.ts
    • It should, in theory, make debugging parsing failures easier.
    • tl;dr: I think it's super neat that this is possible. I am, however, not convinced that it's what we want for our implementation.
  • Has a public _run method called parseNext. It's intended that users can write parsers in an alternative style, see Parser2.test.ts. This lets us hand-write parsers and parsing logic for hot paths of the code.
  • has way less overhead
    • yeah that is useful. I wonder why its overhead is so much lower.

To try it out, I then wrote a parser combinator which calls the new implementation. And then I rewrote the imports grammar to use the new implementation.
Benchmarks are on Discord, but the rough results are that the perf could go from wgsl-linker LOC/sec: 33.229 to wgsl-linker LOC/sec: 123.075.

The unit tests are failing, and that's fine.

@stefnotch
Copy link
Copy Markdown
Contributor Author

stefnotch commented Feb 3, 2025

For tracking spans, I can think of a few different options

  • Add previousToken to checkpoints => track first token with a custom lexer, last token is what checkpoint says
  • Add the rule that parseNext always leaves you at the end of a token. => track first token with a custom lexer, end is checkpoint
  • Add the rule that you are always at the beginning of a token, and add previousToken to checkpoints => [checkpoint().token.span[1], checkpoint()]. Could also allow for another optimisation
  • Or not having a span() combinator, and manually building it up from the info that is already present in the tokens.

What does not work

  • Only using checkpoints. Because whitespace
  • Storing a "previous token", because peek + reset would invalidate that
  • Split parseNext into "skipIgnored and parseNext". parseNext would always leave you at the end of a token, skipIgnored would always bring you to the start of a token. => checkpoints are spans. However I wouldn't know if a child parser already called skipIgnored, so the checkpoint might not be reliable.
  • Just adding "prevTokenEnd" and "nextTokenStart" to the API, because .reset() would invalidate the next token and force us to inefficiently recompute it. There are more efficient variations above.

@stefnotch
Copy link
Copy Markdown
Contributor Author

stefnotch commented Feb 4, 2025

I'm picking the option where parseNext always leaves you at the end of a token, and the span combinator does a peek() (peeking is done as const before = lexer.checkpoint(); const s = lexer.peek().span[0]; lexer.reset(before);).
Then I'll make sure that peeking is optimized (it'll get used a lot), and more importantly: It's a zero cost abstraction!

If it still ends up being slow, I can try that option:

  • Add the rule that you are always at the beginning of a token, and add previousToken to checkpoints => [checkpoint().token.span[1], checkpoint()]. Could also allow for another optimisation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Unprioritized

Development

Successfully merging this pull request may close these issues.

1 participant