Skip to content
Olivier Duhart edited this page Feb 21, 2018 · 22 revisions

Generic Lexer

The generic lexer aims at solving the performance issues with the Regex Lexer. The idea is to start from a limited set of classical lexemes and to refine this set to fit your needs. Those lexemes are recognize through a Finite State Machine, way more efficient than looping through a set of regexes.

The basic lexemes are :

  • GenericToken.Identifier: an identifier. From version 2.0.3 Identifier accepts an extra parameter to specify an identifier pattern :
    • IdentifierType.Alpha : only alpha characters (default value, only pattern available before version 2.0.3)
    • IdentifierType.AlphaNum : starting with an alpha char and then alpha or numeric char
    • IdentifierType.AlphaNumDash : starting with an alpha or ''(underscore) char and then alphanumeric or '-'(minus) or '' (underscore) char
  • GenericToken.String : a classical string delimited by double quotes "
  • GenericToken.Int : an int (i.e. a serie of one or more digit)
  • GenericToken.Double : a float number (decimal separator is dot '.' )
  • GenericToken.keyWord : a keyword is an identifier with a special meaning (it comes with the same constraint as the GenericToken.Identifier. here again performance comes at the price of less flexibility. This lexeme is configurable.
  • GenericToken.SugarToken : a general purpose lexeme with no special constraint except the use of a leading alpha char. this lexer is configurable.

To build a generic lexer Lexeme attribute we have 2 different constructors:

  • static generic lexeme. this constructor allows to do a 1 to 1 mapping between a generic token and your lexer token. It uses only one parameter that is the mapped generic token : [Lexeme(GenericToken.String)] (static lexemes are String, Int , Double and Identifier)
  • configurable lexemes (KeyWord and SugarToken). It takes 2 parameters :
    • the mapped GenericToken
    • the value of the keyword or sugar token.

full example, for a dumb language (generic token based)

  public enum WhileTokenGeneric
    {

        #region keywords 0 -> 19

        [Lexeme(GenericToken.KeyWord,"if")]
        IF = 1,

        [Lexeme(GenericToken.KeyWord, "then")]
        THEN = 2,

        [Lexeme(GenericToken.KeyWord, "else")]
        ELSE = 3,

        [Lexeme(GenericToken.KeyWord, "while")]
        WHILE = 4,

        [Lexeme(GenericToken.KeyWord, "do")]
        DO = 5,

        [Lexeme(GenericToken.KeyWord, "skip")]
        SKIP = 6,

        [Lexeme(GenericToken.KeyWord, "true")]
        TRUE = 7,

        [Lexeme(GenericToken.KeyWord, "false")]
        FALSE = 8,
        [Lexeme(GenericToken.KeyWord, "not")]
        NOT = 9,

        [Lexeme(GenericToken.KeyWord, "and")]
        AND = 10,

        [Lexeme(GenericToken.KeyWord, "or")]
        OR = 11,

        [Lexeme(GenericToken.KeyWord, "(print)")]
        PRINT = 12,

        #endregion

        #region literals 20 -> 29

        // identifier with ```IdentifierType.AlphaNumDash pattern```
        [Lexeme(GenericToken.Identifier, IdentifierType.AlphaNumDash)]
        IDENTIFIER = 20,

        [Lexeme(GenericToken.String)]
        STRING = 21,

        [Lexeme(GenericToken.Int)]
        INT = 22,

        #endregion

        #region operators 30 -> 49

        [Lexeme(GenericToken.SugarToken,">")]
        GREATER = 30,

        [Lexeme(GenericToken.SugarToken, "<")]
        LESSER = 31,

        [Lexeme(GenericToken.SugarToken, "==")]
        EQUALS = 32,

        [Lexeme(GenericToken.SugarToken, "!=")]
        DIFFERENT = 33,

        [Lexeme(GenericToken.SugarToken, ".")]
        CONCAT = 34,

        [Lexeme(GenericToken.SugarToken, ":=")]
        ASSIGN = 35,

        [Lexeme(GenericToken.SugarToken, "+")]
        PLUS = 36,

        [Lexeme(GenericToken.SugarToken, "-")]
        MINUS = 37,


        [Lexeme(GenericToken.SugarToken, "*")]
        TIMES = 38,

        [Lexeme(GenericToken.SugarToken, "/")]
        DIVIDE = 39,

        #endregion 

        #region sugar 50 ->

        [Lexeme(GenericToken.SugarToken, "(")]
        LPAREN = 50,

        [Lexeme(GenericToken.SugarToken, ")")]
        RPAREN = 51,

        [Lexeme(GenericToken.SugarToken, ";")]
        SEMICOLON = 52,

    

        EOF = 0

        #endregion

    }