Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow detection of ambiguities. #472

Closed
daiyam opened this issue May 19, 2017 · 12 comments
Closed

Very slow detection of ambiguities. #472

daiyam opened this issue May 19, 2017 · 12 comments
Labels

Comments

@daiyam
Copy link

daiyam commented May 19, 2017

Hi @bd82,

I want to parse a script with array range and array comprehension.
The best syntax would be:

Statement
  : 'IDENTIFIER' '=' Array ('NL' | 'EOF')

Expression
  : Operand (Operator Operand)*

Array
  : '['
    (
      Operand '..' Operand // range
      |
      'NL'* Expression
        (
          (('NL'+ | 'NL'* ',' 'NL'*) Expression)* // values
          |
          'FOR' 'IDENTIFIER' 'IN' Expression // comprehension
        )?
        'NL'*
    )
    ']'

By doing so, I get an infinity loop.
It must be due to the needed backtracking between Expression and Operand.

Anyway, I was able to have a working syntax with:

Array
  : '['
    (
      'NL'+ Expression
        (
          (('NL'+ | 'NL'* ',' 'NL'*) Expression)*
          |
          'FOR' 'IDENTIFIER' 'IN' Expression
        )?
        'NL'*
      |
      Operand
        (
          '..' Operand
          |
            (Operator Operand)*
            (
              (('NL'+ | 'NL'* ',' 'NL'*) Expression)*
              |
              'FOR' 'IDENTIFIER' 'IN' Expression
            )?
            'NL'*
        )
    )
@bd82
Copy link
Member

bd82 commented May 19, 2017

Can you provide a minimal parser that reproduces the infinite loop?

While it does seem to make sense to have an ambiguity between range operator in an array
and general operators in expression, a runtime infinite loop should not happen or at least be detected earlier.

@daiyam
Copy link
Author

daiyam commented May 19, 2017

Hi @bd82,

After preparing the test case, I've discovered:
There is no infinite loop and the conflict is detected but performSelfAnalysis() takes a very long time.

With the invalid rule:

Ambiguous alternatives: <1 ,2> in <OR> inside <Array> Rule,
<STRING, LEFT_SQUARE, STRING, RIGHT_SQUARE, RIGHT_SQUARE> may appears as a prefix path in all these alternatives.
To Resolve this, try one of of the following: 
...
    at Function.Parser.performSelfAnalysis
...

real	1m16.449s
user	1m14.429s
sys	0m0.583s

With the working rule:

real	0m0.289s
user	0m0.246s
sys	0m0.042s

The more rule I add the more time it takes.

@bd82
Copy link
Member

bd82 commented May 20, 2017

After preparing the test case, I've discovered:
There is no infinite loop and the conflict is detected but performSelfAnalysis() takes a very long time.

This makes more sense.
The logic that looks for ambiguities and computes the lookahead functions is a sort grammar interpreter
which is not very fast (it only needs to run once).

But combined with ambiguities and certain grammars it's run time can start growing exponentially.

A User can workaround this if they reduce the maximum lookahead number (part of the parser config options). But it is best if this is solved/mitigated by the framework.
Perhaps I can improve it's runtime performance by saving intermediate results.

I will still need a grammar that reproduces the slowness.

@daiyam

  • Is your parser publicly available?
  • Can you create a branch in your repo which reproduces the slowness issue?

@daiyam
Copy link
Author

daiyam commented May 20, 2017

I've published the test case at https://github.com/daiyam/issues/tree/master/chevrotain/472

I have an old pc so on a today pc, the slow performSelfAnalysis() should take under 20s 😉

@bd82
Copy link
Member

bd82 commented May 20, 2017

Ok great.

I can reproduce it using your example Repo.
Takes about 40 seconds on my 2015 MBP.

There seems to be a great many ambiguity errors, perhaps another approach
to reduce the runtime would be to set a max number of errors and stop looking
after that number has been reached.

@daiyam
Copy link
Author

daiyam commented May 20, 2017

Just by commenting the lines 123-146 and uncommenting 147-185, there is no ambiguity.

I hadn't looked at the number of errors.
8437 errors is a little too much 😏

10-20 errors should be good enough.

@bd82
Copy link
Member

bd82 commented May 20, 2017

Right, I've debugged this a bit.

It is more complicated than I initially assumed.

  • The performance bottleneck is here during the checking of ambiguous alternatives.
  • While I could halt on some max errors settings, a prior step before any error is found (thus can't never halt at that stage) takes about 50% of the time.

I would have to refactor the whole ambiguity checking logic (if possible) to enable halting
after a max number of errors. So this needs farther investigation.

Another thing I noticed is that when i set the "maxLookahead" option on the parser to 4 (by default it is 5). instead of waiting 40 seconds for the parser initialization error it only takes 800ms.
So another approach may be to reduce the default maxLookahead to 4.
as the performance grows exponentially in relation to the max lookahead.
This may be a breaking change, but only to a few grammars (most grammars do not require K=5) and will provide a clear error message.

@daiyam
Copy link
Author

daiyam commented May 21, 2017

If maxLookahead=4 can avoid to think that there is an infinite loop, I thinks it good for the parser.
And since we can configure maxLookahead, it shouldn't be too much a problem to change it.

@bd82
Copy link
Member

bd82 commented May 22, 2017

If maxLookahead=4 can avoid to think that there is an infinite loop, I thinks it good for the parser.
And since we can configure maxLookahead, it shouldn't be too much a problem to change it.

Yes. I may even change it to 3 by default.

But I still want to free some time for a deeper look into the ambiguity detection algorithm.
Instead of computing all the possible paths of length K for each alternative
and only then checking to see if there are any pairs that cause an ambiguity.

An alternative approach would be to choose an alternative
and then walk "paths" along it "in parallel" with other alternatives until an **un-**ambiguous path
is found or we "run out" of lookahead.

That way it would be very fast in the common use case of no ambiguities.
And would be easy to halt after 10-20 errors have been found in the case of thousands of ambiguous paths.

@bd82
Copy link
Member

bd82 commented Jun 11, 2017

I've played around with this idea.
I am afraid it will only mitigate the problem, not resolve it.
Because to prove a path has no ambiguity with another alternative in the worse case all other paths
must be examined.

My plan is:

  • To change the default maxLookahead to 4

    • Should suffice for 99% of grammars.
  • Document the possibility of very slow initialization when using larger values in an ambiguous grammar.

  • Add a backlog item for Development "Mode" #370 (Development mode) to trace the initialization phases so
    it would be clear to the end user in which phase the parser got "stuck"

@bd82 bd82 changed the title Backtracking & Infinity Loop Very slow detection of ambiguities Jun 11, 2017
@bd82 bd82 added the Bug 🪲 label Jun 11, 2017
@bd82 bd82 changed the title Very slow detection of ambiguities Very slow detection of ambiguities. Jun 11, 2017
@bd82
Copy link
Member

bd82 commented Jun 11, 2017

To re-enable the old behavior of five tokens lookahead pass maxLookahead optional argument
during parser construction.

class MyParser extends Parser {

  constructor(input) {
    // explicit passing of the maxLookahead argument with the old default value.
    super(input, ALL_TOKENS, {maxLookahead: 5})

    // grammar rules...
    Parser.performSelfAnalysis(this)
  }
}

bd82 added a commit that referenced this issue Jun 11, 2017
bd82 added a commit that referenced this issue Jun 11, 2017
@bd82 bd82 closed this as completed in #490 Jun 11, 2017
@bd82
Copy link
Member

bd82 commented Jun 13, 2017

Decided to change the maxLookahead to 4 instead of 3.

There is a common pattern of lambda function versus parenthesis expression that requires 4 tokens
lookahead to distinguish.

// lambda expression, need to read ahead to find the fat arrow ("=>")
(x) => x
// parenthesis expression, identical prefix for first 3 characters of lambda expression.
(x)

bd82 pushed a commit that referenced this issue Jun 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants