Skip to content
Olivier Duhart edited this page Nov 28, 2019 · 22 revisions

Generic Lexer

The generic lexer aims at solving the performance issues with the Regex Lexer. The idea is to start from a limited set of classical lexemes and to refine this set to fit your needs.

Those lexemes are recognize through a Finite State Machine, way more efficient than looping through a set of regexes.

Lexer configuration

The lexer can be configured with a [Lexer] attribute, and is available from version 2.4.0.6. The [Lexer] attribute has several properties:

  • IgnoreWS: Ignore whitespace characters. If false, any whitespace occuring in the lexed text must be explicitly handled in the lexer. Default is true.
  • IgnoreEOL: Ignore end of line characters. If false, any end of line characters occuring in the lexed text must be explicitly handled in the lexer. Default is true.
  • WhiteSpace: An array of characters that are considered whitespace if IgnoreWS is true. Default is ' ' (space) and '\t` (tab).
  • KeyWordIgnoreCase: If true, any keywords ([Lexeme(GenericToken.Keyword, "...")]) are matched ignoring case. That is, the keyword if also matches IF, If, etc. Default is false.

Basic lexemes

The basic lexemes are :

  • GenericToken.Identifier: An identifier. From version 2.0.3 Identifier accepts an extra parameter to specify an identifier pattern:
    • IdentifierType.Alpha: Only alpha characters (default value, only pattern available before version 2.0.3).
    • IdentifierType.AlphaNum: Starting with an alpha char and then alpha or numeric char.
    • IdentifierType.AlphaNumDash: Starting with an alpha or '_' (underscore) char and then alphanumeric or '-'(minus) or '_' (underscore) char.
    • IdentifierType.Custom: Accepts two parameters; the starting character pattern and the rest character pattern. The pattern string contains 'c' (allowed char) and 'l-u' (allowed char range). If '-' (dash) should be an allowed char, it should be the first character in the pattern. An example that duplicates IdentifierType.AlphaNumDash is [Lexeme(GenericToken.Identifier, IdentifierType.Custom, "_A-Za-z", "-_0-9A-Za-z")]. (From version 2.4.0.6)
  • GenericToken.String: A classical string delimited by double quotes ". See below for more details.
  • GenericToken.Int: An int (i.e. a serie of one or more digit).
  • GenericToken.Double: A float number (decimal separator is dot '.').
  • GenericToken.KeyWord: A keyword is an identifier with a special meaning (it comes with the same constraint as the GenericToken.Identifier. Here again performance comes at the price of less flexibility. This lexeme is configurable.
  • GenericToken.SugarToken: A general purpose lexeme with no special constraint except the use of a leading alpha char. This lexer is configurable.

To build a generic lexer Lexeme attribute we have 2 different constructors:

  • static generic lexeme. this constructor allows to do a 1 to 1 mapping between a generic token and your lexer token. It uses only one parameter that is the mapped generic token : [Lexeme(GenericToken.String)] (static lexemes are String, Int , Double and Identifier)
  • configurable lexemes (KeyWord and SugarToken). It takes 2 parameters :
    • the mapped GenericToken
    • the value of the keyword or sugar token.

Strings

Strings lexeme definitions take 2 parameters :

  • a string delimiter char. Default is " (double quote)
  • an escape delimiter char to allow the use of the delimiter char inside a string. Default is \ (backslash). Use of the same char for delimiter and escape char is allowed.

examples

    // matches 'hello \' world' => 'hello ' world'
  [Lexeme(GenericToken.String,"'","\\")]
  STRING

or

  // matches 'that''s my hello world' => 'that's my hello world'
  [Lexeme(GenericToken.String,"'","'")]
  STRING

Many string patterns

Many string patterns are allowes in the same lexer. For instance you should want to match double quote delimited string as well as single quote delimiter string. For this you can simply apply many lexeme attribute with to the same enum value :

    // matches 'hello \' world' => 'hello ' world'
    // as well as "hello \" world" => "hello " world"
    [Lexeme(GenericToken.String,"'","'")]
    [Lexeme(GenericToken.String,"'","\\")]
    STRING

Comments

The generic lexer offers support for comments.

Comments are removed from the token stream before the parse start to ignore them. Nevertheless you can get them, for any special purpose, using directly the lexer.

Comment declaration

Comments use dedicated attributes on enum value that declares the comment delimiters

 [Comments(singleline, multilinestart, multilineend)] 
 COMMENT,
  • singleline : the single line comment delimiter ( "//" for all C derived languages)
  • multilinestart : the starting multi line comment delimiter ( "/*" in all C derived language)
  • multilineend : the closing multi line delimiter ( "/*" in all C derived language)
 [SingleLine(singleline)] 
 SINGlE_LINE_COMMENT,
  • singleline : the single line comment delimiter ( "//" for all C derived languages)
 [MultiLineComment(multilinestart, multilineend)] 
 MULTI_LINE_COMMENT,
  • multilinestart : the starting multi line comment delimiter ( "/*" in all C derived language)
  • multilineend : the closing multi line delimiter ( "/*" in all C derived language)

Full example, for a dumb language (generic token based) ###

  public enum WhileTokenGeneric
    {

        #region keywords 0 -> 19
        
        [Lexeme(GenericToken.KeyWord,"if")]
        IF = 1,

        [Lexeme(GenericToken.KeyWord, "then")]
        THEN = 2,

        [Lexeme(GenericToken.KeyWord, "else")]
        ELSE = 3,

        [Lexeme(GenericToken.KeyWord, "while")]
        WHILE = 4,

        [Lexeme(GenericToken.KeyWord, "do")]
        DO = 5,

        [Lexeme(GenericToken.KeyWord, "skip")]
        SKIP = 6,

        [Lexeme(GenericToken.KeyWord, "true")]
        TRUE = 7,

        [Lexeme(GenericToken.KeyWord, "false")]
        FALSE = 8,
        [Lexeme(GenericToken.KeyWord, "not")]
        NOT = 9,

        [Lexeme(GenericToken.KeyWord, "and")]
        AND = 10,

        [Lexeme(GenericToken.KeyWord, "or")]
        OR = 11,

        [Lexeme(GenericToken.KeyWord, "print")]
        PRINT = 12,

        #endregion

        #region literals 20 -> 29

        // identifier with IdentifierType.AlphaNumDash pattern
        [Lexeme(GenericToken.Identifier, IdentifierType.AlphaNumDash)]
        IDENTIFIER = 20,

        [Lexeme(GenericToken.String)]
        STRING = 21,

        [Lexeme(GenericToken.Int)]
        INT = 22,

        #endregion

        #region operators 30 -> 49

        [Lexeme(GenericToken.SugarToken,">")]
        GREATER = 30,

        [Lexeme(GenericToken.SugarToken, "<")]
        LESSER = 31,

        [Lexeme(GenericToken.SugarToken, "==")]
        EQUALS = 32,

        [Lexeme(GenericToken.SugarToken, "!=")]
        DIFFERENT = 33,

        [Lexeme(GenericToken.SugarToken, ".")]
        CONCAT = 34,

        [Lexeme(GenericToken.SugarToken, ":=")]
        ASSIGN = 35,

        [Lexeme(GenericToken.SugarToken, "+")]
        PLUS = 36,

        [Lexeme(GenericToken.SugarToken, "-")]
        MINUS = 37,


        [Lexeme(GenericToken.SugarToken, "*")]
        TIMES = 38,

        [Lexeme(GenericToken.SugarToken, "/")]
        DIVIDE = 39,

        #endregion 

        #region sugar 50 -> 99

        [Lexeme(GenericToken.SugarToken, "(")]
        LPAREN = 50,

        [Lexeme(GenericToken.SugarToken, ")")]
        RPAREN = 51,

        [Lexeme(GenericToken.SugarToken, ";")]
        SEMICOLON = 52,

    	#endregion
        
        #region comments : C like comments
        
        [Comment("//","/*","*/")]
        COMMENTS = 100
        
        #endregion

        EOF = 0

        #endregion

    }