A simple lexical analyzer for programming and human languages.
There are two easy ways to get lexeme
on your box. You can either download the source or install the ruby gem.
gem install lexeme
or just go to http://www.vladimirivic.com/lexeme/ and download the latest version archive.
Just look under the example
directory for a quick example on how the library can be used to efficiently
tokenize mathematical expressions such is 1 + 3 - sin(0)/cos(1) * pow(6)
. However, since tokenizing
mathematical expressions may not be sufficient for a modern day programming language, another good example
could be a demonstration of the ability to tokenize pseudo code.
Let's say we have a source code of some pseudo program and we save it in a file named pseudo-code.src
:
func hello_world
x = 1
y = x + 2
print "Hello"
fin
Since we can see that there's a couple of lexemes used in this language we will define them as part of the lexer's operative ruleset. To keep things as simple as possible, I'll place the language definition and the lexical analyzer call in the same code base. Ideally, language definition would be something you want to write and include separately.
Our ruby code should look like this:
require 'lexeme'
lexer = Lexeme.define do
token :EQ => /^=$/
token :PLUS => /^\+$/
token :MINUS => /^\-$/
token :MULTI => /^\*$/
token :DIV => /^\/$/
token :NUMBER => /^\d+\.?\d?$/
token :RESERVED => /^(fin|print|func|)$/
token :STRING => /^".*"$/
token :ID => /^[\w_"]+$/
end
tokens = lexer.analyze do
from_file 'pseudo-code.src'
end
tokens.each do |t|
puts "#{t.line} => #{t.name}: #{t.value}"
end
Once ran, the code above should output (line => token_id: token_value):
1 => RESERVED: func
1 => ID: hello_world
2 => ID: x
2 => EQ: =
2 => NUMBER: 1
3 => ID: y
3 => EQ: =
3 => ID: x
3 => PLUS: +
3 => NUMBER: 2
4 => RESERVED: print
4 => STRING: "Hello"
5 => RESERVED: fin
Lexeme can also be used for natral language processing. Here's a quick example on how to do it.
require 'lexeme'
puts "Greetings from Los Angeles!!".tokenize
Running this code will produce:
[WORD: Greetings, WORD: from, WORD: Los, WORD: Angeles, EXCL: !, EXCL: !]
A more advanced example with a customized syntatical rules would be something like this:
require 'lexeme'
lexer = Lexeme.define do
token :STOP => /^\.$/
token :COMA => /^,$/
token :QUES => /^\?$/
token :EXCLAM => /^!$/
token :QUOT => /^"$/
token :APOS => /^'$/
token :WORD => /^[\w\-]+$/
end
tokens = lexer.analyze do
from_string 'Hello! My name is Inigo Montoya. You killed my father. Prepare to die.'
end
tokens.each do |t|
puts "#{t.name}: #{t.value}"
end
Will output:
WORD: Hello
EXCLAM: !
WORD: My
WORD: name
WORD: is
WORD: Inigo
WORD: Montoya
STOP: .
WORD: You
WORD: killed
WORD: my
WORD: father
STOP: .
WORD: Prepare
WORD: to
WORD: die
STOP: .
Version 0.0.5
-
Added line number to each token object. Useful for hinting errors to the user (thanks Rick).
tokens.each do |token| puts "#{token.line} => #{token.name}: #{token.value}" end
Any help on this project is very welcome. Please feel free to fork, modify and make pull requests. Also make sure you check the TODO file when the file is present in the repository.
Lexeme was written by Vladimir Ivic (vladimir.ivic at icloud.com) and is released under the MIT license.