fix(lexer): allow unicode sequences in tokens #1621

Zyclotrop-j · 2021-09-04T04:39:54Z

Allow tokens to use patterns like '/\u{10334}/u'
Change addStartOfInput and addStickyFlag to keep the 'u' flag

Allow tokens to use patterns like '/\u{10334}/u' Change addStartOfInput and addStickyFlag to keep the 'u' flag Fixes #1620

bd82 · 2021-10-29T15:01:27Z

As mentioned in the linked issue.
I am closing this as it should be resolved as part of a much larger feature in #1670

bd82 · 2021-10-29T15:04:09Z

Thanks for the effort @Zyclotrop-j 👍 unfortunately I am concerned that merging this partial fix will cause new bugs and strange behaviors so we will have to wait for a full resolution.

Zyclotrop-j · 2021-10-30T00:42:48Z

Hi @bd82 ,
Thanks for the heads-up.
What is the timeline for the full resolution and/or how can I help to get that done and Unicode sequence support introduced?

bd82 · 2021-10-30T12:35:06Z

Hello @Zyclotrop-j

There are no timelines as this is my free time side project so it depends on the amount of free time / energy / random choice of which item to work on next.

If your workaround is good enough for you, you can try applying it via "patch-package" on your repo

https://www.npmjs.com/package/patch-package
make sure to disable lexer-optimizations (in Lexer constructor)

That will likely be the fastest way you can integrate it.

In regards to contributions: while normally I would be very happy to accept contributions.
With this issue it could be more complicated, as I am not sure which is the best way to approach the problem
(See #1670).

At the moment I believe that my approach would be to:

map missing capabilities / bugs in regexp-to-ast
Implement most/all of the missing capabilities.
move regexp-to-ast into this repo.
Update Chevrotain source code to use the new capabilities (this is part of what you have implemented here).

So this seem quite a bit more complicated than a simple feature/fix contribution PR.
And the plan may change, e.g If I discover that step (2) is too complex, I may choose to make another attempt with
the "regexpp" library.

fix(lexer): allow unicode sequences in tokens

0a04978

Allow tokens to use patterns like '/\u{10334}/u' Change addStartOfInput and addStickyFlag to keep the 'u' flag Fixes #1620

Zyclotrop-j mentioned this pull request Sep 4, 2021

Unicode in pattern: "Range out of order in character class" Error #1620

Closed

bd82 closed this Oct 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lexer): allow unicode sequences in tokens #1621

fix(lexer): allow unicode sequences in tokens #1621

Zyclotrop-j commented Sep 4, 2021

bd82 commented Oct 29, 2021

bd82 commented Oct 29, 2021

Zyclotrop-j commented Oct 30, 2021

bd82 commented Oct 30, 2021 •

edited

Loading

fix(lexer): allow unicode sequences in tokens #1621

fix(lexer): allow unicode sequences in tokens #1621

Conversation

Zyclotrop-j commented Sep 4, 2021

bd82 commented Oct 29, 2021

bd82 commented Oct 29, 2021

Zyclotrop-j commented Oct 30, 2021

bd82 commented Oct 30, 2021 • edited Loading

bd82 commented Oct 30, 2021 •

edited

Loading