Skip to content

Commit

Permalink
Exhaustive MySQL Parser (#157)
Browse files Browse the repository at this point in the history
## Context
This PR ships an exhaustive MySQL **lexer** and **parser** that produce
a MySQL query AST. This is the first step to significantly improve MySQL
compatibility and expand WordPress plugin support on SQLite. It's an
easier, more stable, and an easier to maintain method than the current
token processing. It will also dramatically improve WordPress Playground
experience – database integration is the single largest source of
issues.

This PR is part of the [Advanced MySQL support
project](#162).

See the [MySQL parser
proposal](#106 (comment))
for additional context.

## This PR ships
1. A **MySQL lexer**, adapted from the AI-generated one by @adamziel.
It's over 3x smaller and close to 2x faster.
2. A **MySQL grammar** written in ANTLR v4 format, adapted from the
[MySQL Workbench
grammar](https://github.com/mysql/mysql-workbench/blob/8.0.38/library/parsers/grammars/MySQLParser.g4)
by adding and fixing some cases and reordering some rules.
3. A **script to factor, convert, and compress the grammar** to a PHP
array.
4. A **dynamic recursive parser** implemented by @adamziel.
5. A **script to extract tests** from the MySQL repository.
6. A **test suite of almost 70k queries**.
7. WIP **SQLite driver** by @adamziel, a demo and foundation for the
next phase.

At the moment, all the new files are omitted from the plugin build, so
they have no effect on production whatsoever.

## Running tests
The lexer & parser tests suite is not yet integrated into the CI and
existing test commands. To run the tests, use:
```php
php tests/parser/run-lexer-tests.php
php tests/parser/run-parser-tests.php
```
This will lex / lex & parse all the ~70k queries.

## Implementation

### Parser

A simple recursive parser to transform `(token stream, grammar) => parse
tree`. In this PR, we use MySQL tokens and MySQL grammar, but the same
parser could also support XML, IMAP, many other grammars (as long as
they have some specific properties).

The `parse_recursive()` method is just 100 lines of code (excluding
comments). All of the parsing rules are provided by the grammar.

### run-mysql-driver.php

A quick and dirty implementation of what a `MySQL parse tree ➔ SQLite`
database driver could look like. It easily supports `WITH` and `UNION`
queries that would be really difficult to implement the current SQLite
integration plugin.

The tree transformation is an order of magnitude easier to read, expand,
and maintain than the current implementation. I stand by this, even
though the temporary `ParseTreeTools`/`SQLiteTokenFactory` API included
in this PR seems annoying, and I'd like to ship something better than
that. Here's a glimpse:

```php

function translateQuery($subtree, $rule_name=null) {
    if(is_token($subtree)) {
        $token = $subtree;
        switch ($token->type) {
            case MySQLLexer::EOF: return new SQLiteExpression([]);
            case MySQLLexer::IDENTIFIER:
                return SQLiteTokenFactory::identifier(
                    SQLiteTokenFactory::identifierValue($token)
                );

            default:
                return SQLiteTokenFactory::raw($token->text);
        }
    }

    switch($rule_name) {
        case 'indexHintList':
            // SQLite doesn't support index hints. Let's
            // skip them.
            return null;

        case 'fromClause':
            // Skip `FROM DUAL`. We only care about a singular 
            // FROM DUAL statement, as FROM mytable, DUAL is a syntax
            // error.
            if(
                ParseTreeTools::hasChildren($ast, MySQLLexer::DUAL_SYMBOL) && 
                !ParseTreeTools::hasChildren($ast, 'tableReferenceList')
            ) {
                return null;
            }

        case 'functionCall':
            $name = $ast[0]['pureIdentifier'][0]['IDENTIFIER'][0]->text;
            return translateFunctionCall($name, $ast[0]['udfExprList']);
    }
}
```

## Technical details

### MySQL Grammar

We use the [MySQL workbench
grammar](https://github.com/mysql/mysql-workbench/blob/8.0/library/parsers/grammars/MySQLParser.g4),
manually adapted, modified, and fixed, and converted from ANTLR4 format
to a PHP array.

The grammar conversion pipeline is done by `convert-grammar.php` and
goes like this:

1. Parse MySQLParser.g4 grammar into a PHP tree.
2. Flatten the grammar so that any nested rules become top-level and are
referenced by generated names. This factors compound rules into separate
rules, e.g. `query ::= SELECT (ALL | DISTINCT)` becomes `query ::=
select %select_fragment0` and `%select_fragment0 ::= ALL | DISTINCT`.
3. Expand `*`, `+`, `?` modifiers into separate, right-recursive rules.
For example, `columns ::= column (',' column)*` becomes `columns ::=
column columns_rr` and `columns_rr ::= ',' column | ε`.
6. Compress and export the grammar as a PHP array. It replaces all
string names with integers and ships an int->string map to reduce the
file size.

The `mysql-grammar.php` file size is ~70kb in size, which is small
enough. The parser can handle about 1000 complex SELECT queries per
second on a MacBook Pro. It only took a few easy optimizations to go
from 50/seconds to 1000/second. There's a lot of further optimization
opportunities once we need more speed. We could factor the grammar in
different ways, explore other types of lookahead tables, or memoize the
matching results per token. However, I don't think we need to do that in
the short term. If we spend enough time factoring the grammar, we could
potentially switch to a LALR(1) parser and cut most time spent on
dealing with ambiguities.

## Known issues
There are some small issues and incomplete edge cases. Here are the ones
I'm currently aware of:
1. A very special case in the lexer is not handled — While identifiers
can't consist solely of numbers, in the identifier part after a `.`,
this is possible (e.g., `1ea10.1` is a table name & column name). This
is not handled yet, and it may be worth checking if all cases in the
identifier part after a `.` are handled correctly.
2. Another very special case in the lexer — While the lexer does support
version comments, such as `/*!80038 ... /` and nested comments within
them, a nested comment within a non-matched version is not supported
(e.g., `SELECT 1 /*!99999 /* */ */`). Additionally, we currently support
only 5-digit version specifiers (`80038`), but 6 digits should probably
work as well (`080038`).
3. Version specifiers are not propagated to the PHP grammar yet, and
versions are not applied in the grammar yet (only in the lexer). This
will be better to bring in together with version-specific test cases.
4. Some rules in the grammar may not have version specifiers, or they
may be incorrect.
7. The `_utf8` underscore charset should be version-dependent (only on
MySQL 5), and maybe some others are too. We can check this by `SHOW
CHARACTER SET` on different MySQL versions.
8. The PHPized grammar now contains array indexes of the main rules,
while previously they were not listed. It seems there are numeric gaps.
It might be a regression caused when manually parsing the grammar. I
suppose it's an easy fix.
9. Some components need better test coverage (although the E2E 70k query
test suite is pretty good for now).
10. The tests are not run on CI yet.
11. I'm not sure if the new code fully satisfies the plugin PHP version
requirement. We need to check that — e.g., that there are no PHP 7.1
features used. Not fully sure, but I think there's no lint for PHP
version in the repo, so we could add it.

This list is mainly for me, in order not to forget these. I will later
port it into a tracking issue with a checklist.

## Updates
Since the thread here is pretty long, here are quick links to the
work-in-progress updates:
- [First update with a MySQL query test
suite.](#157 (comment))
- [Quick update, focusing on
lexer.](#157 (comment))
- [Custom grammer conversion script, preserving version, fixes, and
more.](#157 (comment))
- [Wrap
up](#157 (comment)).

## Next steps

These could be implemented either in follow-up PRs or as updates to this
PR – whichever is more convenient:

* Bring in a comprehensive MySQL queries test suite, similar to [WHATWG
URL test
data](https://github.com/web-platform-tests/wpt/blob/master/url/resources/urltestdata.json)
for parsing URLs. First, just ensure the parser either returns null or
any parse tree where appropriate. Then, once we have more advanced tree
processing, actually assert the parser outputs the expected query
structures.
* Create a `MySQLOnSQLite` database driver to enable running MySQL
queries on SQLite. Read [this
comment](#106 (comment))
for more context. Use any method that's convenient for generating SQLite
queries. Feel free to restructure and redo any APIs proposed in this PR.
Be inspired by the idea we may build a `MySQLOnPostgres` driver one day,
but don't actually build any abstractions upfront. Make the driver
generic so it can be used without WordPress. Perhaps it could implement
a PDO driver interface?
* Port MySQL features already supported by the SQLite database
integration plugin to the new `MySQLOnSQLite` driver. For example,
`SQL_CALC_FOUND_ROWS` option or the `INTERVAL` syntax.
* Run SQLite database integration plugin test suite on the new
`MySQLOnSQLite` driver and ensure they pass.
* Rewire this plugin to use the new `MySQLOnSQLite` driver instead of
the current plumbing.

---------

Co-authored-by: Jan Jakes <[email protected]>
  • Loading branch information
adamziel and JanJakes authored Nov 18, 2024
1 parent b5ce5f0 commit ac75e90
Show file tree
Hide file tree
Showing 25 changed files with 126,608 additions and 1 deletion.
6 changes: 5 additions & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,9 @@
composer.json export-ignore
phpcs.xml.dist export-ignore
phpunit.xml.dist export-ignore
tests/*.php export-ignore
/grammar-tools export-ignore
/tests export-ignore
/wip export-ignore
/wp-includes/mysql export-ignore
/wp-includes/parser export-ignore
wp-includes/sqlite/class-wp-sqlite-crosscheck-db.php export-ignore
1 change: 1 addition & 0 deletions composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
"php": ">=7.0"
},
"require-dev": {
"ext-mbstring": "*",
"dealerdirect/phpcodesniffer-composer-installer": "^0.7.0",
"squizlabs/php_codesniffer": "^3.7",
"wp-coding-standards/wpcs": "^3.1",
Expand Down
Loading

0 comments on commit ac75e90

Please sign in to comment.