Very slow detection of ambiguities. #472

daiyam · 2017-05-19T10:20:00Z

I want to parse a script with array range and array comprehension.
The best syntax would be:

Statement
  : 'IDENTIFIER' '=' Array ('NL' | 'EOF')

Expression
  : Operand (Operator Operand)*

Array
  : '['
    (
      Operand '..' Operand // range
      |
      'NL'* Expression
        (
          (('NL'+ | 'NL'* ',' 'NL'*) Expression)* // values
          |
          'FOR' 'IDENTIFIER' 'IN' Expression // comprehension
        )?
        'NL'*
    )
    ']'

By doing so, I get an infinity loop.
It must be due to the needed backtracking between Expression and Operand.

Anyway, I was able to have a working syntax with:

Array
  : '['
    (
      'NL'+ Expression
        (
          (('NL'+ | 'NL'* ',' 'NL'*) Expression)*
          |
          'FOR' 'IDENTIFIER' 'IN' Expression
        )?
        'NL'*
      |
      Operand
        (
          '..' Operand
          |
            (Operator Operand)*
            (
              (('NL'+ | 'NL'* ',' 'NL'*) Expression)*
              |
              'FOR' 'IDENTIFIER' 'IN' Expression
            )?
            'NL'*
        )
    )

The text was updated successfully, but these errors were encountered:

bd82 · 2017-05-19T18:10:15Z

Can you provide a minimal parser that reproduces the infinite loop?

While it does seem to make sense to have an ambiguity between range operator in an array
and general operators in expression, a runtime infinite loop should not happen or at least be detected earlier.

daiyam · 2017-05-19T22:26:23Z

Hi @bd82,

After preparing the test case, I've discovered:
There is no infinite loop and the conflict is detected but performSelfAnalysis() takes a very long time.

With the invalid rule:

Ambiguous alternatives: <1 ,2> in <OR> inside <Array> Rule,
<STRING, LEFT_SQUARE, STRING, RIGHT_SQUARE, RIGHT_SQUARE> may appears as a prefix path in all these alternatives.
To Resolve this, try one of of the following: 
...
    at Function.Parser.performSelfAnalysis
...

real	1m16.449s
user	1m14.429s
sys	0m0.583s

With the working rule:

real	0m0.289s
user	0m0.246s
sys	0m0.042s

The more rule I add the more time it takes.

bd82 · 2017-05-20T06:08:17Z

After preparing the test case, I've discovered:
There is no infinite loop and the conflict is detected but performSelfAnalysis() takes a very long time.

This makes more sense.
The logic that looks for ambiguities and computes the lookahead functions is a sort grammar interpreter
which is not very fast (it only needs to run once).

But combined with ambiguities and certain grammars it's run time can start growing exponentially.

A User can workaround this if they reduce the maximum lookahead number (part of the parser config options). But it is best if this is solved/mitigated by the framework.
Perhaps I can improve it's runtime performance by saving intermediate results.

I will still need a grammar that reproduces the slowness.

@daiyam

Is your parser publicly available?
Can you create a branch in your repo which reproduces the slowness issue?

daiyam · 2017-05-20T08:30:39Z

I've published the test case at https://github.com/daiyam/issues/tree/master/chevrotain/472

I have an old pc so on a today pc, the slow performSelfAnalysis() should take under 20s 😉

bd82 · 2017-05-20T09:14:37Z

Ok great.

I can reproduce it using your example Repo.
Takes about 40 seconds on my 2015 MBP.

There seems to be a great many ambiguity errors, perhaps another approach
to reduce the runtime would be to set a max number of errors and stop looking
after that number has been reached.

daiyam · 2017-05-20T09:57:24Z

Just by commenting the lines 123-146 and uncommenting 147-185, there is no ambiguity.

I hadn't looked at the number of errors.
8437 errors is a little too much 😏

10-20 errors should be good enough.

bd82 · 2017-05-20T21:23:29Z

Right, I've debugged this a bit.

It is more complicated than I initially assumed.

The performance bottleneck is here during the checking of ambiguous alternatives.
While I could halt on some max errors settings, a prior step before any error is found (thus can't never halt at that stage) takes about 50% of the time.
- This is the prior step

I would have to refactor the whole ambiguity checking logic (if possible) to enable halting
after a max number of errors. So this needs farther investigation.

Another thing I noticed is that when i set the "maxLookahead" option on the parser to 4 (by default it is 5). instead of waiting 40 seconds for the parser initialization error it only takes 800ms.
So another approach may be to reduce the default maxLookahead to 4.
as the performance grows exponentially in relation to the max lookahead.
This may be a breaking change, but only to a few grammars (most grammars do not require K=5) and will provide a clear error message.

daiyam · 2017-05-21T08:17:53Z

If maxLookahead=4 can avoid to think that there is an infinite loop, I thinks it good for the parser.
And since we can configure maxLookahead, it shouldn't be too much a problem to change it.

bd82 · 2017-05-22T18:25:16Z

If maxLookahead=4 can avoid to think that there is an infinite loop, I thinks it good for the parser.
And since we can configure maxLookahead, it shouldn't be too much a problem to change it.

Yes. I may even change it to 3 by default.

But I still want to free some time for a deeper look into the ambiguity detection algorithm.
Instead of computing all the possible paths of length K for each alternative
and only then checking to see if there are any pairs that cause an ambiguity.

An alternative approach would be to choose an alternative
and then walk "paths" along it "in parallel" with other alternatives until an **un-**ambiguous path
is found or we "run out" of lookahead.

That way it would be very fast in the common use case of no ambiguities.
And would be easy to halt after 10-20 errors have been found in the case of thousands of ambiguous paths.

bd82 · 2017-06-11T06:58:17Z

I've played around with this idea.
I am afraid it will only mitigate the problem, not resolve it.
Because to prove a path has no ambiguity with another alternative in the worse case all other paths
must be examined.

My plan is:

To change the default maxLookahead to 4
- Should suffice for 99% of grammars.
Document the possibility of very slow initialization when using larger values in an ambiguous grammar.
Add a backlog item for Development "Mode" #370 (Development mode) to trace the initialization phases so
it would be clear to the end user in which phase the parser got "stuck"

bd82 · 2017-06-11T17:34:41Z

To re-enable the old behavior of five tokens lookahead pass maxLookahead optional argument
during parser construction.

class MyParser extends Parser {

  constructor(input) {
    // explicit passing of the maxLookahead argument with the old default value.
    super(input, ALL_TOKENS, {maxLookahead: 5})

    // grammar rules...
    Parser.performSelfAnalysis(this)
  }
}

Fixes #472.

bd82 · 2017-06-13T05:55:02Z

Decided to change the maxLookahead to 4 instead of 3.

There is a common pattern of lambda function versus parenthesis expression that requires 4 tokens
lookahead to distinguish.

// lambda expression, need to read ahead to find the fat arrow ("=>")
(x) => x
// parenthesis expression, identical prefix for first 3 characters of lambda expression.
(x)

fixes #472

bd82 changed the title ~~Backtracking & Infinity Loop~~ Very slow detection of ambiguities Jun 11, 2017

bd82 added the Bug 🪲 label Jun 11, 2017

bd82 changed the title ~~Very slow detection of ambiguities~~ Very slow detection of ambiguities. Jun 11, 2017

bd82 added a commit that referenced this issue Jun 11, 2017

Smaller default maxLookahead.

c33cc54

Fixes #472.

bd82 mentioned this issue Jun 11, 2017

Smaller default maxLookahead. #490

Merged

bd82 added a commit that referenced this issue Jun 11, 2017

Smaller default maxLookahead.

8993f2e

Fixes #472.

bd82 closed this as completed in #490 Jun 11, 2017

bd82 pushed a commit that referenced this issue Jun 13, 2017

Default maxLookahead should be 4 not 3.

bb1699b

fixes #472

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow detection of ambiguities. #472

Very slow detection of ambiguities. #472

daiyam commented May 19, 2017 •

edited

Loading

bd82 commented May 19, 2017 •

edited

Loading

daiyam commented May 19, 2017 •

edited

Loading

bd82 commented May 20, 2017

daiyam commented May 20, 2017

bd82 commented May 20, 2017

daiyam commented May 20, 2017

bd82 commented May 20, 2017

daiyam commented May 21, 2017

bd82 commented May 22, 2017

bd82 commented Jun 11, 2017 •

edited

Loading

bd82 commented Jun 11, 2017

bd82 commented Jun 13, 2017

Very slow detection of ambiguities. #472

Very slow detection of ambiguities. #472

Comments

daiyam commented May 19, 2017 • edited Loading

bd82 commented May 19, 2017 • edited Loading

daiyam commented May 19, 2017 • edited Loading

bd82 commented May 20, 2017

daiyam commented May 20, 2017

bd82 commented May 20, 2017

daiyam commented May 20, 2017

bd82 commented May 20, 2017

daiyam commented May 21, 2017

bd82 commented May 22, 2017

bd82 commented Jun 11, 2017 • edited Loading

bd82 commented Jun 11, 2017

bd82 commented Jun 13, 2017

daiyam commented May 19, 2017 •

edited

Loading

bd82 commented May 19, 2017 •

edited

Loading

daiyam commented May 19, 2017 •

edited

Loading

bd82 commented Jun 11, 2017 •

edited

Loading