-
Notifications
You must be signed in to change notification settings - Fork 115
Comparison With Pygments
- CodeRay is a Ruby library, Pygments is written in Python.
- CodeRay supports 19 languages, while Pygments supports over 90.
- CodeRay has handwritten scanners. In Pygments, scanners are defined with a scanner DSL.
The last two differences in the list above are very much related.
Pro:
- faster
- lots of fine tuning is possible
- no overhead for DSL transformation and interpretation
- more flexible
Contra:
- writing scanners is a lot of work
- almost nobody understands how to create good scanners
(Note: In Pygments, scanners are called “lexers”.)
Pro:
- easier to write, read, and maintain
- less code
- even beginners can write decent scanners
- DSL interpreter can be optimized/changed independently
- porting scanners is easier
- use of higher-level features (like token groups or stacks) is simple
Contra:
- may need hacks for complex languages (eg. the ExtendedRegexLexer)
A common scanner/lexer definition language, which can be read by both Pygments and a hypothetical ports in other languages, would be most useful. The definitions could be maintained in a common code repository.
Here’s a spontaneous example of a possible JSON representation:
{
"name": "Diff",
"aliases": ["diff"],
"filenames": ["*.diff"],
"tokens": {
"root": [
[" .*\n", "Text"],
["\+.*\n", "Generic.Inserted"],
["-.*\n", "Generic.Deleted"],
["@.*\n", "Generic.Subheading"],
["Index.*\n", "Generic.Heading"],
["=.*\n", "Generic.Heading"],
[".*\n", "Text"]
],
...
}
}
Python’s regexps are more powerful than the regexps of Ruby 1.8, and less powerful than the new Ruby 1.9 ones. However, most expressions used in the scanners can be interpreted by all engines. Ruby’s StringScanner has some limitations in the use of regexps.
CodeRay represents tokens with a Token Kinds, which is just a Ruby :symbol (source).
Pygments uses a hierarchical token type/subtype system (source), which is more complex to implement (and slower), but more flexible and easier to understand for authors of new language definitions.
CodeRay supports token groups, which map nicely to SPANs in the HTML output. A token group has a token kind and can contain tokens and other token groups. The final color of a token depends on the group nesting it is in (for example, string/delimiter
has a different color than regexp/delimiter
.) Groups are represented with special :open
and :close
tokens.
Token groups allow CSS-style color definitions, which are most useful for HTML output. Pygments doesn’t have a comparable feature; you can see that strings are usually a single token in Pygments, while the delimiting quotes are usually separate tokens in CodeRay.
CodeRay is optimized for HTML/CSS output. The concept of token groups may be ported to LaTeX or console output, but it’s not trivial.
Pygments has filters, which manipulate the token stream in some way. You can do some cool tricks with these. CodeRay currently lacks such a feature.
Pygments and CodeRay allow extension via plugins. The specific details are different, but it’s simple.