Skip to content

global implicit `tokens`

Sam Atman edited this page May 22, 2013 · 1 revision

This is a discussion of a feature that doesn't exist.

@Englelberg has indicated that this could be added but is concerned that it will complicate understanding of instaparse. That's true, and there's no way around it: a new class of ability requires a deeper level of understanding.

So this is a documentation-first foray into whether the tradeoff is worth it. Parentheticals are reality, bare statements are mock documentaiton.

##Global Implicit Tokens.

instaparse, unlike many similar programs, does not have a separate lexing pass. This is part of the charm of the program: it gives great flexibility to realizing grammars. It is, however, a conscious tradeoff.

A primary reason lexers exist is to remove tokens that aren't part of the grammar rules, proper, before the parser gets to them. This requires you to decide what a token is ahead of time, while instaparse does not, but it provides one great convenience in particular: you can declare and remove whitespace from the token stream.

(instaparse, at present, requires you to define a WS rule (which should probably include your block comments) and then place it as WS* in every single possible place you might find whitespace. This can give a grammar where 30% or more tokens are WS* for common use cases, since nearly all programming languages and many data formats are premised on a tokenizing pass with whitespace removal.)

instaparse therefore provides a special back-tick quotation form for literals and regular expressions. If you are parsing an improbable language where foo may show up anywhere at any time, foo-rule : foo will check for 'foo' after checking for every other possible literal or regular match.

To picture what's happening, it helps to think of instaparse as having a lazy tokenizer. I didn't write it, but I gather this is how it works, because in order to consume a string, you simply must match the pieces of it in literal or regular terms. There is no other way to move forward.

If the tokenizer runs out of options, instaparse throws an error. Global implicit rules are the last rules the tokenizer tries before giving up. So if you specify another-foo-rule : "foo" and place it into your grammar, another-foo-rule will always be reached before (global, implicit) foo.

(provide code example, including an equivalent grammar defined literally w/o backticks)

Clone this wiki locally