Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Liberation] EBNF processor #1981

Open
adamziel opened this issue Nov 3, 2024 · 5 comments
Open

[Data Liberation] EBNF processor #1981

adamziel opened this issue Nov 3, 2024 · 5 comments

Comments

@adamziel
Copy link
Collaborator

adamziel commented Nov 3, 2024

I started drafting an EBNF notation processor to potentially enable building parsers based on a grammar file, similarly to the MySQL parser (although that one is based on ANTLR grammar). I don't have any specific action for this issue, I only wanted to dump the code somewhere it would be searchable in the future. CC @JanJakes

<?php

class EBNF_Processor {
    private $ebnf;
    private $bytes_already_parsed = 0;
    private $last_error;

    private $rule_name;
    private $rules = [];

    private $state = self::STATE_READY;

    const STATE_READY = 0;
    const STATE_FINISHED = 1;

    public function __construct($ebnf) {
        $this->ebnf = $ebnf;
    }

    public function get_last_error() {
        return $this->last_error;
    }

    public function is_finished() {
        return $this->state === self::STATE_FINISHED;
    }

    public function next_rule() {
        if($this->is_finished() || $this->get_last_error()) {
            return false;
        }

        while(true) {
            if(false === $this->skip_comments()) {
                return false;
            }
            if(false === $this->parse_rule_name()) {
                return false;
            }
         
            die();
        }

        $this->state = self::STATE_FINISHED;
        return false;
    }

    private function parse_branches() {
        $at = $this->bytes_already_parsed;
        if($at >= strlen($this->ebnf)) {
            $this->last_error = 'Unexpected end of input';
            return false;
        }
        switch($this->ebnf[$at]) {
            case '(':
                break;
            case '"':
            case "'":
                break;
        }
    }

    private function skip_whitespace($at) {
        return strspn($this->ebnf, " \t", $at);
    }

    private function skip_comments() {
        while(true) {
            $at = $this->bytes_already_parsed;
            if($at + 2 >= strlen($this->ebnf)) {
                return false;
            }

            if(
                $this->ebnf[$at] === "/" &&
                $this->ebnf[$at + 1] === "/"
            ) {
                // Skip comment
                $this->bytes_already_parsed = strpos($this->ebnf, "\n", $at) + 1;
            } else {
                break;
            }
        }
    }

    private function parse_rule_name() {
        $at = $this->bytes_already_parsed;
        // Skip whitespace at the beginning of the rule
        $at += strspn($this->ebnf, " \t", $at);

        $rule_name_length = strspn($this->ebnf, "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", $at);
        if( 0 === $rule_name_length ) {
            $this->last_error = 'Invalid rule name';
            return false;
        }
        $this->rule_name = substr($this->ebnf, $at, $rule_name_length);
        $at += $rule_name_length;
        $at += $this->skip_whitespace($at);

        // Skip past the '::='
        if( strlen($this->ebnf) <= $at + 3 ) {
            $this->last_error = 'Unexpected end of input';
            return false;
        }
        if( $this->ebnf[$at] !== ':' || $this->ebnf[$at + 1] !== ':' || $this->ebnf[$at + 2] !== '=' ) {
            $this->last_error = 'Expected "::="';
            return false;
        }
        $at += 3;

        $this->bytes_already_parsed = $at;
    }
}
@JanJakes
Copy link

JanJakes commented Nov 4, 2024

@adamziel Nice! It would be great to have that, and I think it's better to use the EBNF syntax than ANTLR because the ANTLR format can misleadingly imply the usage of a specific toolset. Having that, we could then add a custom ENBF-based grammar toolset for analyzing, converting, and compressing the grammars.

I actually think we could switch to EBNF in WordPress/sqlite-database-integration#157 because I see us more likely to maintain our own MySQL grammar rather than trying to synchronize if from MySQL Workbench. That said, we'd need support for parametrization (conditionally enabling or disabling some rules) to make it MySQL version aware. I don't think EBNF supports that out of the box, but it seems this could help us: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form#Extensibility

@adamziel
Copy link
Collaborator Author

adamziel commented Nov 4, 2024

It would be nice to have both ANTLR and EBNF support! I only started exploring this to see how easily we could parse Markdown. I ended up using a parser library for now but I'd like to revisit this eventually and either .g4 or .ebnf seem fine.

@JanJakes
Copy link

JanJakes commented Nov 4, 2024

@adamziel We could do that too! After all, the basic syntax is very similar. In any case, having a custom lean grammar toolset would be great, as the existing tools we've found so far weren't great.

@JanJakes
Copy link

JanJakes commented Nov 4, 2024

@adamziel One more open question here is lexing. Each new parser will either need a manually written lexer (some could be fairly simple), or we'd need to build a more generic one (maybe similar to Phlexy by @nikic).

@adamziel
Copy link
Collaborator Author

adamziel commented Nov 4, 2024

We could support character ranges and run the parser directly on the input string

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants