Skip to content

Store keywords in enum ~2x perf. improvement #193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jun 11, 2020

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Jun 8, 2020

Proposal:

  • Store keywords in an enum.

  • Store enum in a sorted array in order to do a lookup

This provides some benefits:

  • Performance is likely better by doing much fewer string comparisons, not storing strings in AST, etc and a bit less memory. Also using the AST as a consumer will be faster (for hashing, traversal, etc).
  • Enum can be used more easily than strings. This can also result in less typo's, better auto completion.
  • We can see whether there are unused keywords

Any ideas on this?

@Dandandan
Copy link
Contributor Author

Performance-wise it's about 2x as fast for the two example queries.

@coveralls
Copy link

coveralls commented Jun 8, 2020

Pull Request Test Coverage Report for Build 132507543

  • 274 of 311 (88.1%) changed or added relevant lines in 4 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.5%) to 91.421%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/parser.rs 267 304 87.83%
Files with Coverage Reduction New Missed Lines %
src/parser.rs 2 88.68%
Totals Coverage Status
Change from base Build 130895395: -0.5%
Covered Lines: 4039
Relevant Lines: 4418

💛 - Coveralls

@Dandandan Dandandan changed the title [WIP] Store keywords in enum Store keywords in enum Jun 8, 2020
@Dandandan Dandandan changed the title Store keywords in enum Store keywords in enum ~2x perf. improvement Jun 9, 2020
@nickolay
Copy link
Contributor

nickolay commented Jun 9, 2020

Sorry about the delay, this took some thinking.

I like this direction, but not because of the reasons you list:

  • a large part of the perf improvement can be obtained by changing assert!s to be debug_assert!s (do you have a real performance problem anyway?)
  • I don't think we store the keywords in the AST, and
  • if we ever want to live up to the "extensible" promise, we'll have to ditch the central list of keywords; I consider the enum is just a convenient way to assign numeric IDs to the keywords for now.

Rather, I like that this will allow matching on regular Tokens and keyword tokens in a single match, without the nested match (match tok { Token::Word(w) => match w.keyword) and the associated verbosity and duplication, which is not possible with String keywords.

macro_rules! patT { // for use in pattern position (vs T![] in expression position)
    ($kw:ident) => {
        Token::Word(Word { keyword: Some(kw![$kw]), .. })
    }
    // other token types can be handled here, e.g. `patT![+]` -- inspired by https://github.com/rust-analyzer/rust-analyzer/blob/3a7c218fd40c77246c94d28b36b1c567492e5bcb/crates/ra_parser/src/grammar.rs#L96
}
match tok {
    patT![TRUE] | patT![FALSE] | patT![NULL] => {

I think we should wait for the current PRs (except for my WIP one) to be finished before merging such major changes. And I'll need to think more about the specifics of this PR.

@nickolay
Copy link
Contributor

nickolay commented Jun 10, 2020

For this PR I'd like to:

  • Separate the mechanical changes (replacing literals like "KEYWORD" with a constant) from the other changes, with the mechanical changes isolated to a commit of its own, perhaps by using a macro instead of AllKeyWords::KEYWORD
  • Add a not-a-keyword variant to the enum itself, instead of using Option everywhere. I'll post a PR (update: Use Token::EOF instead of Option<Token> #195) removing the Option<> wrapper for Tokens (where None represents EOF), which demonstrates that this leads to simpler code with simpler matches.
  • Rename AllKeyWords -> Keyword

Are you willing to work on that?

Next we'll be able to do the match simplification mentioned in my previous comment and, I hope, figure out a way to remove specialized parser methods for keywords, merging consume_token and parse_keyword, for example, into a single method

nickolay referenced this pull request in nickolay/sqlparser-rs Jun 10, 2020
The clone kludge in `_ => self.expected("date/time field",
Token::Word(w.clone())))` will become unnecessary once we stop using
a separate match for the keywords, as suggested in
https://github.com/andygrove/sqlparser-rs/pull/193#issuecomment-641607194
@Dandandan
Copy link
Contributor Author

Sounds like a good way forward! This will provide both of the ergonomical benefits and (though I agree of less relevance for many users and for further development) performance benefit.

@nickolay
Copy link
Contributor

Great! How do you want to proceed? Would you like me to hold off merging #195 (I can rebase it once this is landed, if you prefer)?

Also note that I didn't mean to dismiss the performance requirements of certain users, I just want to understand what those are - so that we can prioritize appropriately.

@Dandandan
Copy link
Contributor Author

I am fine with merging #195 first!
We can implement the changes with some of the suggestions one by one as you suggested based on ideas behind this PR.

As for performance, the biggest change comes from reducing the number of string comparisons by converting them as early as possible.

For your assert! suggestion: are you aware of any expensive asserts?

@Dandandan
Copy link
Contributor Author

Dandandan commented Jun 10, 2020

Ah, I think you mean :

assert!(keywords::ALL_KEYWORDS.contains(&expected));

Which is there before this PR.
Yeah, this probably is a large factor of the perf. change.

nickolay referenced this pull request Jun 10, 2020
This simplifies codes slightly, removing the need deal with the EOF case explicitly.

The clone kludge in `_ => self.expected("date/time field",
Token::Word(w.clone())))` will become unnecessary once we stop using
a separate match for the keywords, as suggested in
https://github.com/andygrove/sqlparser-rs/pull/193#issuecomment-641607194
@nickolay
Copy link
Contributor

Ah, I think you mean : assert!(keywords::ALL_KEYWORDS.contains(&expected));

That's the one. There's a similar one in parse_one_of_keywords. I measured the effect of simply removing them and it was around 70% I think. Comparing strings is not as good as comparing integers, but it's quite efficient (I don't remember where I first read about it, but I found this description now)

@Dandandan
Copy link
Contributor Author

Yeah, I see, that will mostly have the same impact.

So, the other checks will mostly convert to pointer/length or pointer/length/memcmp, but indeed, those should be cheap compared to the assert!s

@Dandandan
Copy link
Contributor Author

I Implemented the changes + merged with master:

  • AllKeyWords -> Keyword
  • Using NoKeyword rather than Option<Keyword>

Is there anything you think that is worth splitting into another PR?
I am not sure whether we can really split something into another commit, except maybe the contains vs to binary_search in the reserved keywords. The other changes are all related to changing the keyword to this Keyword type.
The asserts! that used to be there also could stay (or converted to debug_assert), although they make less sense now!

Also I am not sure whether we should change anything to the RESERVED_FOR_TABLE_ALIAS RESERVED_FOR_COLUMN_ALIAS as they are listed just once.

@Dandandan
Copy link
Contributor Author

Some style improvement might also be to change

make_word(word: &str, quote_style: Option<char>)

into

make_word_quoted(word: &str, quote_style: char)

and

make_word(word: &str)

or something like that.

Copy link
Contributor

@nickolay nickolay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other changes are all related to changing the keyword to this Keyword type.

Related - sure they are. There are lots of mechanical changes though, and it's been hard to find the manual changes among the automated ones, and it's those changes that need review -- I had comments about the ones I did find.

If you had used KWD![FOO] instead ofKeyword::FOO, you could have made one commit introducing KWD and replacing "KEYWORD" with KWD![KEYWORD], but keeping the string keywords, and another committing switching the keyword storage.

I don't think it's necessary now, but I'd appreciate if you highlighted the spots in parser.rs where you did manual changes in case I missed some.

@nickolay
Copy link
Contributor

The asserts! that used to be there also could stay (or converted to debug_assert), although they make less sense now!

Yep, no point in keeping them.

make_word_quoted(word: &str, quote_style: char)

No, I don't think it's worth it, given it's a small helper that's used only a few times.

@nickolay
Copy link
Contributor

@Dandandan are you still working on this or is it ready to merge?

@Dandandan
Copy link
Contributor Author

Just addressed the comment "I'd like to keep the logical grouping and ordering here, as well as the comments." by reverting the change to use binary search for the reserved keywords. It seems to affect performance only little (because they are relatively small in size), but makes sense to have them together.

So now it is good to go for me!

I also experimented a bit with using lazy_static HashMap for the keywords map and HashSet for reserved keywords, but the changes are a bit too much I think.

@Dandandan Dandandan requested a review from nickolay June 11, 2020 17:58
@nickolay nickolay merged commit 34548e8 into apache:master Jun 11, 2020
@nickolay
Copy link
Contributor

Awesome, thanks!

@Dandandan
Copy link
Contributor Author

Thanks for the extensive feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants