-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lexer errors on regular expressions #7
Comments
There is a remaining error in rule design to do with regexes; this will be fixed soon. |
Did a quick test of your regular expression changes. I don't know if it was meant to fix most cases already, but testing is just a minute of work with the tests of my work-in-progress parser.
It does however parse:
The messages:
followed by more errors due to no multiline-string support yet. Directly before logging these errors, ANTLR also logs that it attempts full context to my ANTLRErrorListener. Haven't checked what it means yet. |
I don't see any more regular expression parse errors for now. But I have not done a full test, for example i have not tested regular expressions with for non-western characters. |
This problem is not yet fixed; the current 'fix' is an unreliable band-aid - I am still researching the best solution for this one. |
I know of some people in our company who have experience with ANTLR. I'll try asking them for suggestions. I do think this is a language construct that is hard to parse the way it is now. Also because the following ADL seems to be valid and has a use valid case, but is not a regular expression:
A rather ugly solution would be to create a preprocessor that matches all regular expressions with a line-based scanner, then replace all the '/'-characters at the beginning and end with '^'. Only then run the lexer. |
The current rules for regex are a nasty hack; they are in the cadl_primitives.g4 file, and are currently as follows:
The remaining problem is that if you introduce Lexer rules for the regex part, a regex expression can have almost anything in it, and the lexer will then start matching other strings of characters that turn up between '/' characters, i.e. in paths. I originally posted to the Antlr google group on this, asking if there was a way to use the fact that the previous token was a '{' to conditionally turn the regex matching rule on; I didn't get any clear answer. One idea is simply to say that regexes in ADL2 have to be between some character other than '/', e.g. the alternate '^' character included in the current grammar. This means I have to change the ADL Workbench to convert all the existing regexes in slots to ^^ form rather than // form; probably not hard, and maybe that's what we should do. But I have this nagging feeling something is wrong - matching regexes correctly was dead easy in my old yacc/lex grammars in Eiffel. Antlr is difficult for solving some simple problems! |
Ah, but i know a way to possibly do that. It's called Lexical Mode. See: https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules#LexerRules-LexicalModes You can use the three commands mode, popMode and pushMode like you use the skip command that you use on whitespace and comments, like so: TokenName : «alternative» -> command-name Then you can specify lexer rules to only exist in a specific mode. So if you do something like: LPARENT: '{' -> mode('regexp_allowed') You can switch the lexer to a mode where it allows regular expressions, when it normally does not. Then two questions remain:
|
Yep I know about lexical mode. I think it won't help here because there is no way to know when to enter the 'regex allowed' mode. The rule you propose will enter it just because there is a '{', but that symbol is ubiquitous in ADL, and pretty much every possible CADL element can come after it. So that implies that the 'regex allowed' mode is pretty much all of the CADL rules as they now are. I couldn't see a way to define such a rule that would actually work, and also keep the grammar comprehensible. Happy to be proved wrong! |
Maybe it is an idea to start the another lexical mode by {/ and end it by /}? |
I don't know if that can work or not - note that the '{' matching is done inline in the parser rules, which is how I think it should be. If we introduce a lexer pattern for '{/' how does it interact with that other matching? |
On 28-10-15 15:00, Thomas Beale wrote:
The only thing I have not tried is if it is possible to change the lexer |
I wonder, has it been tried, changing the lexer mode when {/ or /} occurs? Or is it not a viable solution? You can always say, that if it is not possible for a grammar to distinguish a mode for certainty, that something is wrong with that grammar. There is a rule that if more then one lexer rule match the same input sequence, the priority goes to the first occuring rule. Another rule says that the lexer recognizes the most input characters So if you have the token {/ defined above the token { It is easy to check by creating a small test grammar. grammar Test; You can see that it recognizes {/ as a different token from { {abc} Here the rule does not exist, and it reports an error, but parses the rest. Here also |
I just fixed it with a rather simple pragmatich approach: match { / REGEXP ASSUMED_VALUE? / } as one token in the lexer, plus optional whitespace. I implemented tests, and I could not break it, although I did find #20 (unrelated to regular expressions, but prevents me from writing more tests). This is simple and small and it covers all cases. You could even do it in the parser or lexer if you allow java/target language code in your grammar.
Why matching '{' will not work without custom java code in your lexer: apart from that you have to match '{ WS* /' instead of '{/', the following valid ADL is the reason: TYPE[id1] matches {/start/test matches {/this should work/}} '/start/test' is a path, '/this should work/' is a regexp. So you have to look ahead to ahead of the next / before you can determine it's a regular expression or a path. And you cannot do that with lexical modes. |
There is no regular expression lexer rule present. So if you try to parse for example
You get lexer errors:
The text was updated successfully, but these errors were encountered: