#LY is a parser generator halfway between parser combinators and parser generator like
I needed a solution for building parsers and found all existing solution either
- too complicated to integrate with an additional build step as with ANTLR )
- or too different from the classical BNF notation (as parser combinators like sprache or Eto.Parse). These tools are great, but I don't feel comfortable with them.
SLY is highly inspired by the python lex yacc library (PLY)
The parser and lexer implementations fully reside in a single class. The class describes
- every token definition for the lexer
- every grammar rule and its associated action to transform the syntaxic tree.
Install from the NuGet gallery GUI or with the Package Manager Console using the following command:
Install-Package sly
or with dotnet core
dotnet add package sly
For now the lexer is a poor man regex based lexer inspired by this post
So it's not a very efficient lexer. Indeed this lexer is slow and is the bottleneck of the whole lexer/parser.
It could be improved in the future.
The full lexer configuration is done in a C# enum
:
The enum
is listing all the possible tokens (no special constraint here except public visibility)
Each enum
value has a [Lexeme]
attribute to mark it has a lexeme. The lexeme attribute takes 3 parameters:
string regex
: a regular expression that captures the lexemeboolean isSkippable
(optional, default isfalse
): a boolean , true if the lexeme must be ignored ( whitespace for example)boolean isLineending
(optionanl, default isfalse
) : true if the lexeme matches a line end (to allow line counting while lexing).
The lexer can be used apart from the parser. It provides a method that returns an IEnumerable<Token<T>>
(where T is the tokens enum
) from a string
IList<Token<T>> tokens = Lexer.Tokenize(source).ToList<Token<T>>();
You can also build only a lexer using :
ILexer<ExpressionToken> lexer = LexerBuilder.BuildLexer<ExpressionToken>();
var tokens = lexer.Tokenize(source).ToList();
public enum ExpressionToken
{
// float number
[Lexeme("[0-9]+\\.[0-9]+")]
DOUBLE = 1,
// integer
[Lexeme("[0-9]+")]
INT = 3,
// the + operator
[Lexeme("\\+")]
PLUS = 4,
// the - operator
[Lexeme("\\-")]
MINUS = 5,
// the * operator
[Lexeme("\\*")]
TIMES = 6,
// the / operator
[Lexeme("\\/")]
DIVIDE = 7,
// a left paranthesis (
[Lexeme("\\(")]
LPAREN = 8,
// a right paranthesis )
[Lexeme("\\)")]
RPAREN = 9,
// a whitespace
[Lexeme("[ \\t]+",true)]
WS = 12,
[Lexeme("[\\n\\r]+", true, true)]
EOL = 14
}
A parser is of type Parser<TIn,TOut> where :
- TIn is the enum token type as seen before
- TOut is the type of object produced bye the parser. Classicaly it will be an Asbtract Syntax Tree (AST) or it may be an int for an expression parser.
The grammar defining the parser is defined using C# attribute [Production("some grammar rule")]
mapped to methods ( in the same class used for the lexer)
The rules follow the classical BNF notation. A terminal notation must exactly matches (case sensitive) an enum value. Once the wytaxic tree build, the methods of each rule will be used as a syntaxic tree visitor. Each methods takes as many parameters as rule clauses. Each parameter can be typed according to the return value of the clause :
- for a terminal : the
Token<T>
corresponding to the token - for a non terminal : the result of the evaluation of the non terminal, i.e the value returned by the matching static method. As the parser output is typed (TOut as seen before) , the result of an evaluation for a non terminal is necessarily of type TOut.
a mathematical parser calculate a mathematical expression. It takes a string as input and return a numeric value. So each method of the parser will return a numeric value (an int for simplicity concern)
[Production("primary: INT")]
public int Primary(Token<ExpressionToken> intToken)
{
return intToken.IntValue;
}
[Production("primary: LPAREN expression RPAREN")]
public int Group(object discaredLParen, int groupValue ,object discardedRParen)
{
return groupValue;
}
[Production("expression : term PLUS expression")]
[Production("expression : term MINUS expression")]
public int Expression(int left, Token<ExpressionToken> operatorToken, int right)
{
object result = 0;
switch (operatorToken.TokenID)
{
case ExpressionToken.PLUS:
{
result = left + right;
break;
}
case ExpressionToken.MINUS:
{
result = left - right;
break;
}
default:
{
break;
}
}
return result;
}
as we 've seen above a parser is declared on a single class with static methods that address :
-
lexer configuration (with the
[LexerConfiguration]
attribute ) -
grammar rules (with the
[Production("")]
attribute )
Once the class with all its methods has been written, it can be used to build the effective parser instance calling ParserBuilder.BuildParser. the builder methods takes 3 parameters :
- an instance of the class containing the lexer and parser definition
- the kind of parser. Currently only a recursive descent parsers are available. this implementation is limited to LL grammar by construction (no left recursion).There are 2 possible types :
- the root rule for the parser.
the parser is typed according to the token type.
ExpressionParser expressionParserDefinition = new ExpressionParser()
// here see the typing :
// ExpressionToken is the token enum type
// int is the type of a parse evaluation
Parser<ExpressionToken,int> Parser = ParserBuilder.BuildParser<ExpressionToken,int>(expressionParserDefinition,
ParserType.LL_RECURSIVE_DESCENT,
"expression");
then calling
```C#parser.Parse("some source code")```
will return the evaluation of the syntax tree.
the parser returns a ParseResult instance containing the evaluation value or a list of errors.
```c#
string expression = "1 + 1";
ParseResult<ExpressionToken> r = Parser.Parse(expression);
if (!r.IsError && r.Result != null && r.Result is int)
{
Console.WriteLine($"result of {expression} is {(int)r.Result}");
}
else
{
if (r.Errors !=null && r.Errors.Any())
{
// display errors
r.Errors.ForEach(error => Console.WriteLine(error.ErrorMessage));
}
}
One build a parser expose :
-
a main API through the
Parse(string content)
method (chain lexical analysis, syntax parsing and finally call your parsing methods) -
the lexer through the Lexer property
-
the syntax parser through the SyntaxParser property (which type is a
ISyntaxParser
)
Full examples are available under :
- jsonparser : a json parser
- expressionParser : a mathematical expression parser
- You can also look at Tests that presents a simple EBNF grammar in EBNFTests
you can now use EBNF notation :
- '*' to repeat 0 or more the same terminal or non terminal
- '+' to repeat once or more the same terminal or non terminal
for repeated elements values passed to [Production]
methods are :
List<TOut>
for a repeated non terminalList<Token<TIn>>
for a repeated terminal
See EBNFJsonParser.cs for a complete EBNF json parser.
The EBNF notation has been implemented in CSLY using the BNF notation. The EBNF parser builder is built using the BNF parser builder. Incidently the EBNF parser builder is a good and complete example for BNF parser : RuleParser.cs