Performance problem #131

russellw · 2016-04-12T14:24:53Z

There seems to be a performance problem, though I'm not sure whether it's intrinsic or something I'm doing wrong:

http://stackoverflow.com/questions/36572997/any-way-to-speed-up-instaparse

Engelberg · 2016-04-13T03:00:58Z

There are a bunch of performance notes here: https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md

Usually an instaparse parser performs linearly with respect to the size of the input up to some point at which it exhausts memory and starts thrashing the garbage collector. Usually this point where it breaks down is when the input gets somewhere between 20k-200k, depending on the complexity of the grammar. (If the parser performance isn't linear on small inputs, then the culprit is usually ambiguity in the grammar.)

So a 700k file is really pushing the limits of what instaparse is good for. The reason is that instaparse has really robust backtracking, so it has to consider the possibility that the very last character of the file may force it to reinterpret the way that the entire file is matched against the grammar, conceivably backtracking all the way back to the beginning if necessary. This means maintaining a history of every decision point made for the entire 700k of parsing, which is quite a bit to track.

If the 700k file is comprised of a bunch of individual records, and it is easy to identify the boundaries, the best thing to do is to chop them up into individual records and pass these smaller strings to your instaparse parser.

You can also try to use the :optimize :memory flag (see the Performance doc for more info), which attempts to automate the above strategy, throwing away backtracking history each time a chunk is successfully parsed. That strategy doesn't work on all grammars, but I think with your grammar, it probably would. It's certainly worth a try.

If you get a chance to try the :optimize :memory flag, I'd be interested to know whether it helped. When I first released instaparse, I was mainly thinking in terms of parsing small-ish strings and source-code-sized files (which tend to be under 20k). So it is useful for me to get good test samples of large files in the wild, so I can identify and implement strategies which allow instaparse to perform well for those sizes. :optimize :memory was my first foray in that direction.

Engelberg · 2016-04-13T03:09:13Z

BTW, here's a simple example illustrating my point about backtracking. Imagine the following grammar:

S = Option1 | Option2
Option1 = 'a'+
Option2 = 'a'+ 'b'

If you imagine matching this grammar against a 700k string of all a's up until the final 'b', instaparse would first try matching using Option1, and it would hum along merrily until it hit the final 'b', at which point it has to backtrack all the way to the beginning and try Option2.

This is why most parsing engines require you to write a LL(k) grammar which can determine with certainty what rule to use based on looking ahead at most k characters.

Beyond :optimize :memory, one of my ideals would be to allow the programmer to say "I know this is an LL(1) grammar" or "I know this can be parsed correctly with a packrat strategy", and use an even more efficient underlying strategy. This would allow someone to use the same convenient grammar description and toggle instantly between a bunch of different engines to see what works, with the current engine representing the most flexible parsing strategy.

russellw · 2016-04-13T11:27:46Z

Ah! okay, I'll switch to handwritten parsers then - I have some files up to
hundreds of megabytes to deal with. But can you put a note maybe near the
start of the readme that instaparse is only intended for small files?

On Wed, Apr 13, 2016 at 4:01 AM, Mark Engelberg [email protected]
wrote:

There are a bunch of performance notes here:
https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md

Usually an instaparse parser performs linearly with respect to the size of
the input up to some point at which it exhausts memory and starts thrashing
the garbage collector. Usually this point where it breaks down is when the
input gets somewhere between 20k-200k, depending on the complexity of the
grammar. (If the parser performance isn't linear on small inputs, then the
culprit is usually ambiguity in the grammar.)

So a 700k file is really pushing the limits of what instaparse is good
for. The reason is that instaparse has really robust backtracking, so it
has to consider the possibility that the very last character of the file
may force it to reinterpret the way that the entire file is matched against
the grammar, conceivably backtracking all the way back to the beginning if
necessary. This means maintaining a history of every decision point made
for the entire 700k of parsing, which is quite a bit to track.

If the 700k file is comprised of a bunch of individual records, and it is
easy to identify the boundaries, the best thing to do is to chop them up
into individual records and pass these smaller strings to your instaparse
parser.

You can also try to use the :optimize :memory flag (see the Performance
doc for more info), which attempts to automate the above strategy, throwing
away backtracking history each time a chunk is successfully parsed. That
strategy doesn't work on all grammars, but I think with your grammar, it
probably would. It's certainly worth a try.

If you get a chance to try the :optimize :memory flag, I'd be interested
to know whether it helped. When I first released instaparse, I was mainly
thinking in terms of parsing small-ish strings and source-code-sized files
(which tend to be under 20k). So it is useful for me to get good test
samples of large files in the wild, so I can identify and implement
strategies which allow instaparse to perform well for those sizes. :optimize
:memory was my first foray in that direction.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#131 (comment)

chrismurrph · 2016-08-28T00:56:41Z

I'm pretty sure I've hit the limit. One more rule and running the parser thrashes my machine possibly forever, never returning. Using :optimize :memory fixed it. However because this option is only available for insta/parse (not insta/parses), I now loose the ability to detect multiple possibilities as soon as they happen. I think the solution I'll end up coming to is what you recommend - doing a high/root level parse at the beginning, handing off various bits to more specific parsers. And some of these more specific parsers will be used multiple times.
I've embarked on that root level parser strategy a bit - the problem I'm finding is that it is difficult for me to parse through 'everything except "END_ROUTINE"'. Is it advised to break up the original source file using a hand written code that just finds the locations of particular text markers, say using index-of and subs?

I used index-of and subs and was able to break it up. Smaller parsers will be easier to maintain.

Engelberg · 2016-08-31T09:52:15Z

My instinct would be to use re-seq with an appropriate regex to break it apart. In any case, this strategy of making a high level parse and then handing chunks off to more specific parsers is something I'd like to make easier to do in instaparse. If you have any tips that you've learned from going through it, let me know.

chrismurrph · 2016-09-01T02:46:38Z

I looked at all the regex functions and none of them tell you the index in the String. I need indexes because simply want to find 'BEGIN_BLOCK' and 'END_BLOCK' - then find the whole of the middle in between. Perhaps real problem is I'm not good enough at regular expressions to match ['BEGIN_BLOCK' - anything but 'END_BLOCK' - 'END_BLOCK'].

Engelberg · 2016-09-01T07:10:09Z

#"BEGIN_BLOCK[\s\S]*?END_BLOCK"
would do the trick.

The key is the *? combination which is non-greedy, meaning it stops at
the first END_BLOCK it gets to, rather than the last.

Also, if you want indexes, you can call Clojure's re-matcher which will
return an instance of the Matcher class and you can use re-find to
advance the regex and then call the methods .start and .end on it to get
the indexes.

=> (def m (re-matcher #"begin[\s\S]*?end" "begin this is my text end begin
and more text end"))
#'interleave.core/m
=> (re-find m)
"begin this is my text end"
=> (.start m)
0
=> (.end m)
25
=> (re-find m)
"begin and more text end"
=> (.start m)
26
=> (.end m)
49

On Wed, Aug 31, 2016 at 7:46 PM, Chris Murphy [email protected]
wrote:

I looked at all the regex functions and none of them tell you the index in
the String. I need indexes because simply want to find 'BEGIN_BLOCK' and
'END_BLOCK' - then find the whole of the middle in between. Perhaps real
problem is I'm not good enough at regular expressions to match
['BEGIN_BLOCK' - anything but 'END_BLOCK' - 'END_BLOCK'].

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#131 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAIdS25U3Lh7BYKx0qKPU0nww76D15eOks5qljyOgaJpZM4IFcX1
.

chrismurrph · 2016-09-02T04:52:47Z

One thing you might be interested in (in case you don't already know) is the concept of Cuts from http://www.lihaoyi.com/fastparse/. Seems to me to be like :optimize :memory, but not right from the outset - rather from where decides to draw a line upon the parsing up to now.

stereobooster · 2024-06-30T19:08:02Z

concept of Cuts from http://www.lihaoyi.com/fastparse/

old link is broken, here is the new one https://com-lihaoyi.github.io/fastparse/#Cuts

thesophiaxu mentioned this issue Jul 17, 2020

Figure out why loading big pages is so slow athensresearch/athens#240

Closed

stereobooster mentioned this issue Jun 30, 2024

Performance of (c/plus (c/alt ...)) combinator vs (c/regexp #"[...]+") #207

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance problem #131

Performance problem #131

russellw commented Apr 12, 2016

Engelberg commented Apr 13, 2016

Engelberg commented Apr 13, 2016

russellw commented Apr 13, 2016

chrismurrph commented Aug 28, 2016 •

edited

Loading

Engelberg commented Aug 31, 2016

chrismurrph commented Sep 1, 2016

Engelberg commented Sep 1, 2016

chrismurrph commented Sep 2, 2016

stereobooster commented Jun 30, 2024

Performance problem #131

Performance problem #131

Comments

russellw commented Apr 12, 2016

Engelberg commented Apr 13, 2016

Engelberg commented Apr 13, 2016

russellw commented Apr 13, 2016

chrismurrph commented Aug 28, 2016 • edited Loading

Engelberg commented Aug 31, 2016

chrismurrph commented Sep 1, 2016

Engelberg commented Sep 1, 2016

chrismurrph commented Sep 2, 2016

stereobooster commented Jun 30, 2024

chrismurrph commented Aug 28, 2016 •

edited

Loading