Skip to content
Olivier Duhart edited this page Feb 23, 2018 · 3 revisions

Regex lexer

This lexer is a poor man regex based lexer inspired by this post So it's not a very efficient lexer. Indeed, when used, this lexer is the bottleneck of the whole lexer/parser. But it is really flexible and easy to use.

The idea of a regex lexer is to associate to every lexeme a matching regex. So a lexeme needs 3 parameters :

  • string regex : a regular expression that captures the lexeme
  • boolean isSkippable (optional, default is false): a boolean , true if the lexeme must be ignored ( whitespace for example)
  • boolean isLineending (optionanl, default is false) : true if the lexeme matches a line end (to allow line counting while lexing).

full example, for a mathematical parser (regex based)

public enum ExpressionToken
    {
        // float number 
        [Lexeme("[0-9]+\\.[0-9]+")]
        DOUBLE = 1,

        // integer        
        [Lexeme("[0-9]+")]
        INT = 3,

        // the + operator
        [Lexeme("\\+")]
        PLUS = 4,

        // the - operator
        [Lexeme("\\-")]
        MINUS = 5,

        // the * operator
        [Lexeme("\\*")]
        TIMES = 6,

        //  the  / operator
        [Lexeme("\\/")]
        DIVIDE = 7,

        // a left paranthesis (
        [Lexeme("\\(")]
        LPAREN = 8,

        // a right paranthesis )
        [Lexeme("\\)")]
        RPAREN = 9,

        // a whitespace
        [Lexeme("[ \\t]+",true)]
        WS = 12, 

        [Lexeme("[\\n\\r]+", true, true)]
        EOL = 14
    }