Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of (c/plus (c/alt ...)) combinator vs (c/regexp #"[...]+") #207

Open
theronic opened this issue Oct 2, 2020 · 2 comments
Open

Comments

@theronic
Copy link

theronic commented Oct 2, 2020

I'm seeing an 80x-120x performance slowdown using the (c/plus (c/alt ...)) combinator to parse a string of special characters as compared to regex, (c/regexp #"[\a\b\c...]+").

In comparison, matching on the same string using Clojure Spec2 Alpha with (s/+ #{\a \b \c ...}) is only 20x slower than the regexp.

The combinators are so much more readable, composable and less error-prone than writing string-based regular expressions, so I'd love to use them, but the performance difference is significant.

Is there a way to speed up a set match, or write a manual matcher? It would be great if I could pass Instaparse any function and it eats whatever the function returns, unless it's a special :instaparse/invalid value.

Here are my Criterium benchmarks (done on i9 MBP16):

(let [test-input        "~|'[@:];|;{{}=:;<];~[^|{?"
       charset           #{\: \; \< \= \> \? \@,
                            \[ \] \\ \^ \_ \',
                            \{ \| \} \~}
      test-spec         (s/+ charset)                     ;; (aside: Spec2 will moan about fully-qualified symbol #bug)
      combinator-parser (insta/parser
                            {:S (c/plus (->> charset
                                          (map (comp c/unicode-char int))
                                          (apply c/alt)))}
                            :start :S)
      regex-parser      (insta/parser
                            {:S (regexp #"[\:\;\<=\>\?\@\[\]\\\^\_\'\{\|\}\~]+")} :start :S)]
    (crit/quick-bench                                       ;; uncomment ones we don't care about
      (s/conform test-spec (seq test-input)) ;; mean: 117.3us
      (insta/parse combinator-parser test-input) ;; mean: 614us
      (insta/parse regex-parser test-input))) ;; mean: 5.43us

(I also tried using the defparser macro, with similiar results.)

@Engelberg
Copy link
Owner

Generally anything that can be expressed as a regular expression will be way faster as a regular expression.

The regular expression has a huge advantage in that the correct interpretation can be determined locally by what immediately follows and nothing beyond.

Instaparse is aiming to be truly general across all types of grammars, so that means the correct alternative right now could potentially be determined by some character at the very end of the string. So instaparse has to keep open all the other possibilities so it can backtrack to this point later if need be.

Consider an alternative "a" or "aa" in a context of "aaabc". The correct match depends entirely on what kinds of rules make the rest of the string make sense. If there's a rule afterwards to match "abc", that's different than if there is a rule that matches "aabc".

Regexes just make a decision (and you have some control over whether it makes greedy or non-greedy choices with certain annotations), and can do some backtracking within its own local context, but it definitely doesn't have to worry about the context of the other rules in your grammar.

So, short answer is: use a regexp whenever that's appropriate, if you want best performance. Usually regexps are the best choice for the "tokens" in your grammar.

There's a deeper concept to be explored that maybe it would be interesting to allow the user to delimit some sort of scope for backtracking, so once you match up to a certain point, instaparse can throw away all the accumulated backtracking spots generated from unexplored alternatives. This would improve performance considerably, allowing instaparse's alternatives to compete more favorably with regexps, but it's complicated to figure out what the interface for that would look like, and complicated to implement. An interesting research project, but not one I have time to pursue in the near future.

@stereobooster
Copy link

stereobooster commented Jun 30, 2024

I have feature request related to this - provide some kind of syntax ({} maybe) to collapse sub-tree to string. So it would be similar to regex behavior:

S = {'a'+}

would produce the same parse tree as

S = #'a+'

There's a deeper concept to be explored that maybe it would be interesting to allow the user to delimit some sort of scope for backtracking, so once you match up to a certain point, instaparse can throw away all the accumulated backtracking spots generated from unexplored alternatives.

Related to this idea of Cuts #131 (comment)


Usually regexps are the best choice for the "tokens" in your grammar.

In Spoofax it is called Lexical syntax

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants