[Data Liberation] EBNF processor #1981

adamziel · 2024-11-03T16:04:25Z

I started drafting an EBNF notation processor to potentially enable building parsers based on a grammar file, similarly to the MySQL parser (although that one is based on ANTLR grammar). I don't have any specific action for this issue, I only wanted to dump the code somewhere it would be searchable in the future. CC @JanJakes

<?php

class EBNF_Processor {
    private $ebnf;
    private $bytes_already_parsed = 0;
    private $last_error;

    private $rule_name;
    private $rules = [];

    private $state = self::STATE_READY;

    const STATE_READY = 0;
    const STATE_FINISHED = 1;

    public function __construct($ebnf) {
        $this->ebnf = $ebnf;
    }

    public function get_last_error() {
        return $this->last_error;
    }

    public function is_finished() {
        return $this->state === self::STATE_FINISHED;
    }

    public function next_rule() {
        if($this->is_finished() || $this->get_last_error()) {
            return false;
        }

        while(true) {
            if(false === $this->skip_comments()) {
                return false;
            }
            if(false === $this->parse_rule_name()) {
                return false;
            }
         
            die();
        }

        $this->state = self::STATE_FINISHED;
        return false;
    }

    private function parse_branches() {
        $at = $this->bytes_already_parsed;
        if($at >= strlen($this->ebnf)) {
            $this->last_error = 'Unexpected end of input';
            return false;
        }
        switch($this->ebnf[$at]) {
            case '(':
                break;
            case '"':
            case "'":
                break;
        }
    }

    private function skip_whitespace($at) {
        return strspn($this->ebnf, " \t", $at);
    }

    private function skip_comments() {
        while(true) {
            $at = $this->bytes_already_parsed;
            if($at + 2 >= strlen($this->ebnf)) {
                return false;
            }

            if(
                $this->ebnf[$at] === "/" &&
                $this->ebnf[$at + 1] === "/"
            ) {
                // Skip comment
                $this->bytes_already_parsed = strpos($this->ebnf, "\n", $at) + 1;
            } else {
                break;
            }
        }
    }

    private function parse_rule_name() {
        $at = $this->bytes_already_parsed;
        // Skip whitespace at the beginning of the rule
        $at += strspn($this->ebnf, " \t", $at);

        $rule_name_length = strspn($this->ebnf, "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", $at);
        if( 0 === $rule_name_length ) {
            $this->last_error = 'Invalid rule name';
            return false;
        }
        $this->rule_name = substr($this->ebnf, $at, $rule_name_length);
        $at += $rule_name_length;
        $at += $this->skip_whitespace($at);

        // Skip past the '::='
        if( strlen($this->ebnf) <= $at + 3 ) {
            $this->last_error = 'Unexpected end of input';
            return false;
        }
        if( $this->ebnf[$at] !== ':' || $this->ebnf[$at + 1] !== ':' || $this->ebnf[$at + 2] !== '=' ) {
            $this->last_error = 'Expected "::="';
            return false;
        }
        $at += 3;

        $this->bytes_already_parsed = $at;
    }
}

JanJakes · 2024-11-04T10:37:04Z

@adamziel Nice! It would be great to have that, and I think it's better to use the EBNF syntax than ANTLR because the ANTLR format can misleadingly imply the usage of a specific toolset. Having that, we could then add a custom ENBF-based grammar toolset for analyzing, converting, and compressing the grammars.

I actually think we could switch to EBNF in WordPress/sqlite-database-integration#157 because I see us more likely to maintain our own MySQL grammar rather than trying to synchronize if from MySQL Workbench. That said, we'd need support for parametrization (conditionally enabling or disabling some rules) to make it MySQL version aware. I don't think EBNF supports that out of the box, but it seems this could help us: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form#Extensibility

adamziel · 2024-11-04T11:14:30Z

It would be nice to have both ANTLR and EBNF support! I only started exploring this to see how easily we could parse Markdown. I ended up using a parser library for now but I'd like to revisit this eventually and either .g4 or .ebnf seem fine.

JanJakes · 2024-11-04T11:34:21Z

@adamziel We could do that too! After all, the basic syntax is very similar. In any case, having a custom lean grammar toolset would be great, as the existing tools we've found so far weren't great.

JanJakes · 2024-11-04T13:56:34Z

@adamziel One more open question here is lexing. Each new parser will either need a manually written lexer (some could be fairly simple), or we'd need to build a more generic one (maybe similar to Phlexy by @nikic).

adamziel · 2024-11-04T15:28:17Z

We could support character ranges and run the parser directly on the input string

github-project-automation bot added this to Playground Board Nov 3, 2024

github-project-automation bot moved this to Inbox in Playground Board Nov 3, 2024

adamziel removed this from Playground Board Nov 3, 2024

JanJakes mentioned this issue Nov 4, 2024

Advanced MySQL compatibility WordPress/sqlite-database-integration#162

Open

38 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data Liberation] EBNF processor #1981

[Data Liberation] EBNF processor #1981

adamziel commented Nov 3, 2024

JanJakes commented Nov 4, 2024

adamziel commented Nov 4, 2024 •

edited

Loading

JanJakes commented Nov 4, 2024

JanJakes commented Nov 4, 2024 •

edited

Loading

adamziel commented Nov 4, 2024

[Data Liberation] EBNF processor #1981

[Data Liberation] EBNF processor #1981

Comments

adamziel commented Nov 3, 2024

JanJakes commented Nov 4, 2024

adamziel commented Nov 4, 2024 • edited Loading

JanJakes commented Nov 4, 2024

JanJakes commented Nov 4, 2024 • edited Loading

adamziel commented Nov 4, 2024

adamziel commented Nov 4, 2024 •

edited

Loading

JanJakes commented Nov 4, 2024 •

edited

Loading