Store keywords in enum ~2x perf. improvement #193

Dandandan · 2020-06-08T06:59:02Z

Proposal:

Store keywords in an enum.
Store enum in a sorted array in order to do a lookup

This provides some benefits:

Performance is likely better by doing much fewer string comparisons, not storing strings in AST, etc and a bit less memory. Also using the AST as a consumer will be faster (for hashing, traversal, etc).
Enum can be used more easily than strings. This can also result in less typo's, better auto completion.
We can see whether there are unused keywords

Any ideas on this?

Dandandan · 2020-06-08T15:41:43Z

Performance-wise it's about 2x as fast for the two example queries.

coveralls · 2020-06-08T16:10:20Z

Pull Request Test Coverage Report for Build 132507543

274 of 311 (88.1%) changed or added relevant lines in 4 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.5%) to 91.421%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/parser.rs	267	304	87.83%

Files with Coverage Reduction	New Missed Lines	%
src/parser.rs	2	88.68%

Totals
Change from base Build 130895395:	-0.5%
Covered Lines:	4039
Relevant Lines:	4418

💛 - Coveralls

nickolay · 2020-06-09T22:07:14Z

Sorry about the delay, this took some thinking.

I like this direction, but not because of the reasons you list:

a large part of the perf improvement can be obtained by changing assert!s to be debug_assert!s (do you have a real performance problem anyway?)
I don't think we store the keywords in the AST, and
if we ever want to live up to the "extensible" promise, we'll have to ditch the central list of keywords; I consider the enum is just a convenient way to assign numeric IDs to the keywords for now.

Rather, I like that this will allow matching on regular Tokens and keyword tokens in a single match, without the nested match (match tok { Token::Word(w) => match w.keyword) and the associated verbosity and duplication, which is not possible with String keywords.

macro_rules! patT { // for use in pattern position (vs T![] in expression position)
    ($kw:ident) => {
        Token::Word(Word { keyword: Some(kw![$kw]), .. })
    }
    // other token types can be handled here, e.g. `patT![+]` -- inspired by https://github.com/rust-analyzer/rust-analyzer/blob/3a7c218fd40c77246c94d28b36b1c567492e5bcb/crates/ra_parser/src/grammar.rs#L96
}
match tok {
    patT![TRUE] | patT![FALSE] | patT![NULL] => {

I think we should wait for the current PRs (except for my WIP one) to be finished before merging such major changes. And I'll need to think more about the specifics of this PR.

nickolay · 2020-06-10T08:48:34Z

For this PR I'd like to:

Separate the mechanical changes (replacing literals like "KEYWORD" with a constant) from the other changes, with the mechanical changes isolated to a commit of its own, perhaps by using a macro instead of AllKeyWords::KEYWORD
Add a not-a-keyword variant to the enum itself, instead of using Option everywhere. I'll post a PR (update: Use Token::EOF instead of Option<Token> #195) removing the Option<> wrapper for Tokens (where None represents EOF), which demonstrates that this leads to simpler code with simpler matches.
Rename AllKeyWords -> Keyword

Are you willing to work on that?

Next we'll be able to do the match simplification mentioned in my previous comment and, I hope, figure out a way to remove specialized parser methods for keywords, merging consume_token and parse_keyword, for example, into a single method

The clone kludge in `_ => self.expected("date/time field", Token::Word(w.clone())))` will become unnecessary once we stop using a separate match for the keywords, as suggested in https://github.com/andygrove/sqlparser-rs/pull/193#issuecomment-641607194

Dandandan · 2020-06-10T09:23:25Z

Sounds like a good way forward! This will provide both of the ergonomical benefits and (though I agree of less relevance for many users and for further development) performance benefit.

nickolay · 2020-06-10T09:50:38Z

Great! How do you want to proceed? Would you like me to hold off merging #195 (I can rebase it once this is landed, if you prefer)?

Also note that I didn't mean to dismiss the performance requirements of certain users, I just want to understand what those are - so that we can prioritize appropriately.

Dandandan · 2020-06-10T10:24:28Z

I am fine with merging #195 first!
We can implement the changes with some of the suggestions one by one as you suggested based on ideas behind this PR.

As for performance, the biggest change comes from reducing the number of string comparisons by converting them as early as possible.

For your assert! suggestion: are you aware of any expensive asserts?

Dandandan · 2020-06-10T10:27:13Z

Ah, I think you mean :

assert!(keywords::ALL_KEYWORDS.contains(&expected));

Which is there before this PR.
Yeah, this probably is a large factor of the perf. change.

This simplifies codes slightly, removing the need deal with the EOF case explicitly. The clone kludge in `_ => self.expected("date/time field", Token::Word(w.clone())))` will become unnecessary once we stop using a separate match for the keywords, as suggested in https://github.com/andygrove/sqlparser-rs/pull/193#issuecomment-641607194

nickolay · 2020-06-10T13:29:39Z

Ah, I think you mean : assert!(keywords::ALL_KEYWORDS.contains(&expected));

That's the one. There's a similar one in parse_one_of_keywords. I measured the effect of simply removing them and it was around 70% I think. Comparing strings is not as good as comparing integers, but it's quite efficient (I don't remember where I first read about it, but I found this description now)

Dandandan · 2020-06-10T14:08:59Z

Yeah, I see, that will mostly have the same impact.

So, the other checks will mostly convert to pointer/length or pointer/length/memcmp, but indeed, those should be cheap compared to the assert!s

Dandandan · 2020-06-10T20:28:04Z

I Implemented the changes + merged with master:

AllKeyWords -> Keyword
Using NoKeyword rather than Option<Keyword>

Is there anything you think that is worth splitting into another PR?
I am not sure whether we can really split something into another commit, except maybe the contains vs to binary_search in the reserved keywords. The other changes are all related to changing the keyword to this Keyword type.
The asserts! that used to be there also could stay (or converted to debug_assert), although they make less sense now!

Also I am not sure whether we should change anything to the RESERVED_FOR_TABLE_ALIAS RESERVED_FOR_COLUMN_ALIAS as they are listed just once.

Dandandan · 2020-06-10T20:41:35Z

Some style improvement might also be to change

make_word(word: &str, quote_style: Option<char>)

into

make_word_quoted(word: &str, quote_style: char)

and

make_word(word: &str)

or something like that.

nickolay

The other changes are all related to changing the keyword to this Keyword type.

Related - sure they are. There are lots of mechanical changes though, and it's been hard to find the manual changes among the automated ones, and it's those changes that need review -- I had comments about the ones I did find.

If you had used KWD![FOO] instead ofKeyword::FOO, you could have made one commit introducing KWD and replacing "KEYWORD" with KWD![KEYWORD], but keeping the string keywords, and another committing switching the keyword storage.

I don't think it's necessary now, but I'd appreciate if you highlighted the spots in parser.rs where you did manual changes in case I missed some.

src/parser.rs

src/dialect/keywords.rs

nickolay · 2020-06-10T21:09:07Z

The asserts! that used to be there also could stay (or converted to debug_assert), although they make less sense now!

Yep, no point in keeping them.

make_word_quoted(word: &str, quote_style: char)

No, I don't think it's worth it, given it's a small helper that's used only a few times.

…unction

nickolay · 2020-06-11T17:41:33Z

@Dandandan are you still working on this or is it ready to merge?

Dandandan · 2020-06-11T17:56:08Z

Just addressed the comment "I'd like to keep the logical grouping and ordering here, as well as the comments." by reverting the change to use binary search for the reserved keywords. It seems to affect performance only little (because they are relatively small in size), but makes sense to have them together.

So now it is good to go for me!

I also experimented a bit with using lazy_static HashMap for the keywords map and HashSet for reserved keywords, but the changes are a bit too much I think.

nickolay · 2020-06-11T19:04:56Z

Awesome, thanks!

Dandandan · 2020-06-11T20:57:53Z

Thanks for the extensive feedback!

Dandandan added 3 commits June 8, 2020 08:52

Store keywords in enum

959e9d0

Finalize conversion

b42e103

Fix RESERVED_FOR_COLUMN_ALIAS

547a6da

Fix last tests

582ca86

Dandandan changed the title ~~[WIP] Store keywords in enum~~ Store keywords in enum Jun 8, 2020

Dandandan added 4 commits June 8, 2020 23:32

Handle other tables in same way

46f88c5

Fix style issue

6afcbeb

Fix style issue

2f85180

Fix format issue

4884156

Dandandan changed the title ~~Store keywords in enum~~ Store keywords in enum ~2x perf. improvement Jun 9, 2020

Dandandan added 4 commits June 10, 2020 19:31

Merge changes with master, add Keyword::NoKeyword

4c793b9

Remove explicit clones

3dd467a

Remove reduntant clones

ca3c10d

Remove reduntant clones

df6d60d

nickolay suggested changes Jun 10, 2020

View reviewed changes

src/parser.rs Outdated Show resolved Hide resolved

src/parser.rs Outdated Show resolved Hide resolved

src/dialect/keywords.rs Outdated Show resolved Hide resolved

Address review comments, do not use ::parse but create normal parse f…

b16c17b

…unction

Group together related keywords

b03f475

Dandandan requested a review from nickolay June 11, 2020 17:58

Dandandan and others added 4 commits June 11, 2020 19:59

Fix Range -> Groups

14da209

Add a test to ensure GROUPS in window specification round-trips

6ac8879

Change the order in RESERVED_FOR_... keyword lists to match

fea2f91

Avoid unnecessary wildcard imports in tests

b551997

nickolay merged commit 34548e8 into apache:master Jun 11, 2020

nickolay mentioned this pull request Jun 12, 2020

Make FileFormat case insensitive #200

Merged

Store keywords in enum ~2x perf. improvement #193

Store keywords in enum ~2x perf. improvement #193

Uh oh!

Conversation

Dandandan commented Jun 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Jun 8, 2020

Uh oh!

coveralls commented Jun 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 132507543

💛 - Coveralls

Uh oh!

nickolay commented Jun 9, 2020

Uh oh!

nickolay commented Jun 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Jun 10, 2020

Uh oh!

nickolay commented Jun 10, 2020

Uh oh!

Dandandan commented Jun 10, 2020

Uh oh!

Dandandan commented Jun 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nickolay commented Jun 10, 2020

Uh oh!

Dandandan commented Jun 10, 2020

Uh oh!

Dandandan commented Jun 10, 2020

Uh oh!

Dandandan commented Jun 10, 2020

Uh oh!

nickolay left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nickolay commented Jun 10, 2020

Uh oh!

nickolay commented Jun 11, 2020

Uh oh!

Dandandan commented Jun 11, 2020

Uh oh!

nickolay commented Jun 11, 2020

Uh oh!

Dandandan commented Jun 11, 2020

Uh oh!

Uh oh!

Dandandan commented Jun 8, 2020 •

edited

Loading

coveralls commented Jun 8, 2020 •

edited

Loading

nickolay commented Jun 10, 2020 •

edited

Loading

Dandandan commented Jun 10, 2020 •

edited

Loading