Advanced MySQL compatibility #162

JanJakes · 2024-11-04T10:23:56Z

This is a tracking issue for introducing a new MySQL query parser and an AST-based SQLite driver to significantly improve MySQL compatibility and expand WordPress plugin support.

I think the project can be represented in roughly the following phases:

Initial version of the MySQL parser — reasonably fast, small, and correct ("make it work").
Initial version of the SQLite driver — reach parity with the current test suite ("make it work").
Advanced MySQL support in the SQLite driver — see also test suites below ("make it right").
Advanced MySQL parser — add version support, verify against MySQL server grammar, fix edge cases ("make it right").
Advanced tooling — implement a custom parser for ANTLR or ENBF, add custom grammar tooling ("make it right").
Performance optimizations — explore possibilities to further optimize the new parser ("make it fast").

Here's a list of tasks and issues:

Core

Exhaustive MySQL Parser #157
AST-based SQLite driver #164 — match current MySQL compatibility
SQLite driver — improve compatibility to satisfy multiple extensive test suites (see below)

Test suites

SQL errors when activating most popular WP plugins: Playground Tester → SQL errors
WP unit tests: Pass WordPress unit tests wordpress-playground#111 (comment)
Statically extracted queries from WP plugins: CI: Monitor support for all the SQL queries used by all the WordPress plugins #159
Supporting all WP CLI commands: Some of the WP-CLI commands don't work wordpress-playground#1618 (comment)

Advanced parser and tooling

MySQL server version support in the new parser
A generic grammar parser (ANTLR or [Data Liberation] EBNF processor wordpress-playground#1981)

Related issues

Here's a list of issues that are likely related. This may need further review and triage:

The text was updated successfully, but these errors were encountered:

xnau · 2024-11-09T02:02:33Z

I don't see this in your list, but while running the plugin I'm developing in Playground, I'm getting a lot of syntax errors due to keywords in the queries, which SQLite doesn't tolerate.

For example

SELECT v.name, g.title AS grouptitle, g.id, g.mode FROM wp_participants_database_fields v INNER JOIN wp_participants_database_groups g ON v.group = g.name ORDER BY v.id

The keyword "group" needs to be escaped for SQLite, while MariaDb is fine with it.

SELECT v.name, g.title AS grouptitle, g.id, g.mode FROM wp_participants_database_fields v INNER JOIN wp_participants_database_groups g ON v.`group` = g.name ORDER BY v.id

JanJakes · 2024-11-13T16:01:11Z

@xnau Thanks for the report! Indeed, some keywords that are non-reserved in MySQL can be reserved in SQLite. We can address this on a per-keyword basis, or by quoting all identifiers.

@adamziel

## Context This PR ships an exhaustive MySQL **lexer** and **parser** that produce a MySQL query AST. This is the first step to significantly improve MySQL compatibility and expand WordPress plugin support on SQLite. It's an easier, more stable, and an easier to maintain method than the current token processing. It will also dramatically improve WordPress Playground experience – database integration is the single largest source of issues. This PR is part of the [Advanced MySQL support project](#162). See the [MySQL parser proposal](#106 (comment)) for additional context. ## This PR ships 1. A **MySQL lexer**, adapted from the AI-generated one by @adamziel. It's over 3x smaller and close to 2x faster. 2. A **MySQL grammar** written in ANTLR v4 format, adapted from the [MySQL Workbench grammar](https://github.com/mysql/mysql-workbench/blob/8.0.38/library/parsers/grammars/MySQLParser.g4) by adding and fixing some cases and reordering some rules. 3. A **script to factor, convert, and compress the grammar** to a PHP array. 4. A **dynamic recursive parser** implemented by @adamziel. 5. A **script to extract tests** from the MySQL repository. 6. A **test suite of almost 70k queries**. 7. WIP **SQLite driver** by @adamziel, a demo and foundation for the next phase. At the moment, all the new files are omitted from the plugin build, so they have no effect on production whatsoever. ## Running tests The lexer & parser tests suite is not yet integrated into the CI and existing test commands. To run the tests, use: ```php php tests/parser/run-lexer-tests.php php tests/parser/run-parser-tests.php ``` This will lex / lex & parse all the ~70k queries. ## Implementation ### Parser A simple recursive parser to transform `(token stream, grammar) => parse tree`. In this PR, we use MySQL tokens and MySQL grammar, but the same parser could also support XML, IMAP, many other grammars (as long as they have some specific properties). The `parse_recursive()` method is just 100 lines of code (excluding comments). All of the parsing rules are provided by the grammar. ### run-mysql-driver.php A quick and dirty implementation of what a `MySQL parse tree ➔ SQLite` database driver could look like. It easily supports `WITH` and `UNION` queries that would be really difficult to implement the current SQLite integration plugin. The tree transformation is an order of magnitude easier to read, expand, and maintain than the current implementation. I stand by this, even though the temporary `ParseTreeTools`/`SQLiteTokenFactory` API included in this PR seems annoying, and I'd like to ship something better than that. Here's a glimpse: ```php function translateQuery($subtree, $rule_name=null) { if(is_token($subtree)) { $token = $subtree; switch ($token->type) { case MySQLLexer::EOF: return new SQLiteExpression([]); case MySQLLexer::IDENTIFIER: return SQLiteTokenFactory::identifier( SQLiteTokenFactory::identifierValue($token) ); default: return SQLiteTokenFactory::raw($token->text); } } switch($rule_name) { case 'indexHintList': // SQLite doesn't support index hints. Let's // skip them. return null; case 'fromClause': // Skip `FROM DUAL`. We only care about a singular // FROM DUAL statement, as FROM mytable, DUAL is a syntax // error. if( ParseTreeTools::hasChildren($ast, MySQLLexer::DUAL_SYMBOL) && !ParseTreeTools::hasChildren($ast, 'tableReferenceList') ) { return null; } case 'functionCall': $name = $ast[0]['pureIdentifier'][0]['IDENTIFIER'][0]->text; return translateFunctionCall($name, $ast[0]['udfExprList']); } } ``` ## Technical details ### MySQL Grammar We use the [MySQL workbench grammar](https://github.com/mysql/mysql-workbench/blob/8.0/library/parsers/grammars/MySQLParser.g4), manually adapted, modified, and fixed, and converted from ANTLR4 format to a PHP array. The grammar conversion pipeline is done by `convert-grammar.php` and goes like this: 1. Parse MySQLParser.g4 grammar into a PHP tree. 2. Flatten the grammar so that any nested rules become top-level and are referenced by generated names. This factors compound rules into separate rules, e.g. `query ::= SELECT (ALL | DISTINCT)` becomes `query ::= select %select_fragment0` and `%select_fragment0 ::= ALL | DISTINCT`. 3. Expand `*`, `+`, `?` modifiers into separate, right-recursive rules. For example, `columns ::= column (',' column)*` becomes `columns ::= column columns_rr` and `columns_rr ::= ',' column | ε`. 6. Compress and export the grammar as a PHP array. It replaces all string names with integers and ships an int->string map to reduce the file size. The `mysql-grammar.php` file size is ~70kb in size, which is small enough. The parser can handle about 1000 complex SELECT queries per second on a MacBook Pro. It only took a few easy optimizations to go from 50/seconds to 1000/second. There's a lot of further optimization opportunities once we need more speed. We could factor the grammar in different ways, explore other types of lookahead tables, or memoize the matching results per token. However, I don't think we need to do that in the short term. If we spend enough time factoring the grammar, we could potentially switch to a LALR(1) parser and cut most time spent on dealing with ambiguities. ## Known issues There are some small issues and incomplete edge cases. Here are the ones I'm currently aware of: 1. A very special case in the lexer is not handled — While identifiers can't consist solely of numbers, in the identifier part after a `.`, this is possible (e.g., `1ea10.1` is a table name & column name). This is not handled yet, and it may be worth checking if all cases in the identifier part after a `.` are handled correctly. 2. Another very special case in the lexer — While the lexer does support version comments, such as `/*!80038 ... /` and nested comments within them, a nested comment within a non-matched version is not supported (e.g., `SELECT 1 /*!99999 /* */ */`). Additionally, we currently support only 5-digit version specifiers (`80038`), but 6 digits should probably work as well (`080038`). 3. Version specifiers are not propagated to the PHP grammar yet, and versions are not applied in the grammar yet (only in the lexer). This will be better to bring in together with version-specific test cases. 4. Some rules in the grammar may not have version specifiers, or they may be incorrect. 7. The `_utf8` underscore charset should be version-dependent (only on MySQL 5), and maybe some others are too. We can check this by `SHOW CHARACTER SET` on different MySQL versions. 8. The PHPized grammar now contains array indexes of the main rules, while previously they were not listed. It seems there are numeric gaps. It might be a regression caused when manually parsing the grammar. I suppose it's an easy fix. 9. Some components need better test coverage (although the E2E 70k query test suite is pretty good for now). 10. The tests are not run on CI yet. 11. I'm not sure if the new code fully satisfies the plugin PHP version requirement. We need to check that — e.g., that there are no PHP 7.1 features used. Not fully sure, but I think there's no lint for PHP version in the repo, so we could add it. This list is mainly for me, in order not to forget these. I will later port it into a tracking issue with a checklist. ## Updates Since the thread here is pretty long, here are quick links to the work-in-progress updates: - [First update with a MySQL query test suite.](#157 (comment)) - [Quick update, focusing on lexer.](#157 (comment)) - [Custom grammer conversion script, preserving version, fixes, and more.](#157 (comment)) - [Wrap up](#157 (comment)). ## Next steps These could be implemented either in follow-up PRs or as updates to this PR – whichever is more convenient: * Bring in a comprehensive MySQL queries test suite, similar to [WHATWG URL test data](https://github.com/web-platform-tests/wpt/blob/master/url/resources/urltestdata.json) for parsing URLs. First, just ensure the parser either returns null or any parse tree where appropriate. Then, once we have more advanced tree processing, actually assert the parser outputs the expected query structures. * Create a `MySQLOnSQLite` database driver to enable running MySQL queries on SQLite. Read [this comment](#106 (comment)) for more context. Use any method that's convenient for generating SQLite queries. Feel free to restructure and redo any APIs proposed in this PR. Be inspired by the idea we may build a `MySQLOnPostgres` driver one day, but don't actually build any abstractions upfront. Make the driver generic so it can be used without WordPress. Perhaps it could implement a PDO driver interface? * Port MySQL features already supported by the SQLite database integration plugin to the new `MySQLOnSQLite` driver. For example, `SQL_CALC_FOUND_ROWS` option or the `INTERVAL` syntax. * Run SQLite database integration plugin test suite on the new `MySQLOnSQLite` driver and ensure they pass. * Rewire this plugin to use the new `MySQLOnSQLite` driver instead of the current plumbing. --------- Co-authored-by: Jan Jakes <[email protected]>

JanJakes self-assigned this Nov 4, 2024

JanJakes mentioned this issue Nov 4, 2024

Exhaustive MySQL Parser #157

Merged

JanJakes mentioned this issue Nov 14, 2024

AST-based SQLite driver #164

Draft

aristath mentioned this issue Nov 20, 2024

Exhaustive MySQL Parser (#157) aristath/sqlite-database-integration#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced MySQL compatibility #162

Advanced MySQL compatibility #162

JanJakes commented Nov 4, 2024 •

edited

Loading

xnau commented Nov 9, 2024 •

edited

Loading

JanJakes commented Nov 13, 2024

Advanced MySQL compatibility #162

Advanced MySQL compatibility #162

Comments

JanJakes commented Nov 4, 2024 • edited Loading

Core

Test suites

Advanced parser and tooling

Related issues

xnau commented Nov 9, 2024 • edited Loading

JanJakes commented Nov 13, 2024

JanJakes commented Nov 4, 2024 •

edited

Loading

xnau commented Nov 9, 2024 •

edited

Loading