Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance problem #131

Open
russellw opened this issue Apr 12, 2016 · 9 comments
Open

Performance problem #131

russellw opened this issue Apr 12, 2016 · 9 comments

Comments

@russellw
Copy link

There seems to be a performance problem, though I'm not sure whether it's intrinsic or something I'm doing wrong:

http://stackoverflow.com/questions/36572997/any-way-to-speed-up-instaparse

@Engelberg
Copy link
Owner

There are a bunch of performance notes here: https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md

Usually an instaparse parser performs linearly with respect to the size of the input up to some point at which it exhausts memory and starts thrashing the garbage collector. Usually this point where it breaks down is when the input gets somewhere between 20k-200k, depending on the complexity of the grammar. (If the parser performance isn't linear on small inputs, then the culprit is usually ambiguity in the grammar.)

So a 700k file is really pushing the limits of what instaparse is good for. The reason is that instaparse has really robust backtracking, so it has to consider the possibility that the very last character of the file may force it to reinterpret the way that the entire file is matched against the grammar, conceivably backtracking all the way back to the beginning if necessary. This means maintaining a history of every decision point made for the entire 700k of parsing, which is quite a bit to track.

If the 700k file is comprised of a bunch of individual records, and it is easy to identify the boundaries, the best thing to do is to chop them up into individual records and pass these smaller strings to your instaparse parser.

You can also try to use the :optimize :memory flag (see the Performance doc for more info), which attempts to automate the above strategy, throwing away backtracking history each time a chunk is successfully parsed. That strategy doesn't work on all grammars, but I think with your grammar, it probably would. It's certainly worth a try.

If you get a chance to try the :optimize :memory flag, I'd be interested to know whether it helped. When I first released instaparse, I was mainly thinking in terms of parsing small-ish strings and source-code-sized files (which tend to be under 20k). So it is useful for me to get good test samples of large files in the wild, so I can identify and implement strategies which allow instaparse to perform well for those sizes. :optimize :memory was my first foray in that direction.

@Engelberg
Copy link
Owner

BTW, here's a simple example illustrating my point about backtracking. Imagine the following grammar:

S = Option1 | Option2
Option1 = 'a'+
Option2 = 'a'+ 'b'

If you imagine matching this grammar against a 700k string of all a's up until the final 'b', instaparse would first try matching using Option1, and it would hum along merrily until it hit the final 'b', at which point it has to backtrack all the way to the beginning and try Option2.

This is why most parsing engines require you to write a LL(k) grammar which can determine with certainty what rule to use based on looking ahead at most k characters.

Beyond :optimize :memory, one of my ideals would be to allow the programmer to say "I know this is an LL(1) grammar" or "I know this can be parsed correctly with a packrat strategy", and use an even more efficient underlying strategy. This would allow someone to use the same convenient grammar description and toggle instantly between a bunch of different engines to see what works, with the current engine representing the most flexible parsing strategy.

@russellw
Copy link
Author

Ah! okay, I'll switch to handwritten parsers then - I have some files up to
hundreds of megabytes to deal with. But can you put a note maybe near the
start of the readme that instaparse is only intended for small files?

On Wed, Apr 13, 2016 at 4:01 AM, Mark Engelberg [email protected]
wrote:

There are a bunch of performance notes here:
https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md

Usually an instaparse parser performs linearly with respect to the size of
the input up to some point at which it exhausts memory and starts thrashing
the garbage collector. Usually this point where it breaks down is when the
input gets somewhere between 20k-200k, depending on the complexity of the
grammar. (If the parser performance isn't linear on small inputs, then the
culprit is usually ambiguity in the grammar.)

So a 700k file is really pushing the limits of what instaparse is good
for. The reason is that instaparse has really robust backtracking, so it
has to consider the possibility that the very last character of the file
may force it to reinterpret the way that the entire file is matched against
the grammar, conceivably backtracking all the way back to the beginning if
necessary. This means maintaining a history of every decision point made
for the entire 700k of parsing, which is quite a bit to track.

If the 700k file is comprised of a bunch of individual records, and it is
easy to identify the boundaries, the best thing to do is to chop them up
into individual records and pass these smaller strings to your instaparse
parser.

You can also try to use the :optimize :memory flag (see the Performance
doc for more info), which attempts to automate the above strategy, throwing
away backtracking history each time a chunk is successfully parsed. That
strategy doesn't work on all grammars, but I think with your grammar, it
probably would. It's certainly worth a try.

If you get a chance to try the :optimize :memory flag, I'd be interested
to know whether it helped. When I first released instaparse, I was mainly
thinking in terms of parsing small-ish strings and source-code-sized files
(which tend to be under 20k). So it is useful for me to get good test
samples of large files in the wild, so I can identify and implement
strategies which allow instaparse to perform well for those sizes. :optimize
:memory was my first foray in that direction.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#131 (comment)

@chrismurrph
Copy link

chrismurrph commented Aug 28, 2016

I'm pretty sure I've hit the limit. One more rule and running the parser thrashes my machine possibly forever, never returning. Using :optimize :memory fixed it. However because this option is only available for insta/parse (not insta/parses), I now loose the ability to detect multiple possibilities as soon as they happen. I think the solution I'll end up coming to is what you recommend - doing a high/root level parse at the beginning, handing off various bits to more specific parsers. And some of these more specific parsers will be used multiple times.
I've embarked on that root level parser strategy a bit - the problem I'm finding is that it is difficult for me to parse through 'everything except "END_ROUTINE"'. Is it advised to break up the original source file using a hand written code that just finds the locations of particular text markers, say using index-of and subs?

I used index-of and subs and was able to break it up. Smaller parsers will be easier to maintain.

@Engelberg
Copy link
Owner

My instinct would be to use re-seq with an appropriate regex to break it apart. In any case, this strategy of making a high level parse and then handing chunks off to more specific parsers is something I'd like to make easier to do in instaparse. If you have any tips that you've learned from going through it, let me know.

@chrismurrph
Copy link

I looked at all the regex functions and none of them tell you the index in the String. I need indexes because simply want to find 'BEGIN_BLOCK' and 'END_BLOCK' - then find the whole of the middle in between. Perhaps real problem is I'm not good enough at regular expressions to match ['BEGIN_BLOCK' - anything but 'END_BLOCK' - 'END_BLOCK'].

@Engelberg
Copy link
Owner

#"BEGIN_BLOCK[\s\S]*?END_BLOCK"
would do the trick.

The key is the *? combination which is non-greedy, meaning it stops at
the first END_BLOCK it gets to, rather than the last.

Also, if you want indexes, you can call Clojure's re-matcher which will
return an instance of the Matcher class and you can use re-find to
advance the regex and then call the methods .start and .end on it to get
the indexes.

=> (def m (re-matcher #"begin[\s\S]*?end" "begin this is my text end begin
and more text end"))
#'interleave.core/m
=> (re-find m)
"begin this is my text end"
=> (.start m)
0
=> (.end m)
25
=> (re-find m)
"begin and more text end"
=> (.start m)
26
=> (.end m)
49

On Wed, Aug 31, 2016 at 7:46 PM, Chris Murphy [email protected]
wrote:

I looked at all the regex functions and none of them tell you the index in
the String. I need indexes because simply want to find 'BEGIN_BLOCK' and
'END_BLOCK' - then find the whole of the middle in between. Perhaps real
problem is I'm not good enough at regular expressions to match
['BEGIN_BLOCK' - anything but 'END_BLOCK' - 'END_BLOCK'].


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#131 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAIdS25U3Lh7BYKx0qKPU0nww76D15eOks5qljyOgaJpZM4IFcX1
.

@chrismurrph
Copy link

One thing you might be interested in (in case you don't already know) is the concept of Cuts from http://www.lihaoyi.com/fastparse/. Seems to me to be like :optimize :memory, but not right from the outset - rather from where decides to draw a line upon the parsing up to now.

@stereobooster
Copy link

concept of Cuts from http://www.lihaoyi.com/fastparse/

old link is broken, here is the new one https://com-lihaoyi.github.io/fastparse/#Cuts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants