Retrieving error correction tokens #455

stephe-ada-guru · 2020-02-22T23:29:46Z

I'm trying to compare the error correction in the libadalang parser with the parser in Emacs Ada mode (and other parsers). One way to do that is to retrieve the token list of the final parse, including "virtual tokens" inserted for error correction, and excluding deleted tokens. Diffing that token list with the similar lists from the other parsers, and with the user-expected "correct" token list, gives a fairly objective measure of error correction.

In doing this for libadalang, I first tried:

   for T in Token_Range (Root (Unit)) loop
      Ada.Wide_Wide_Text_IO.Put_Line (Libadalang.Common.Text (T));
   end loop;

This only gives the "real" tokens; the ones present in the source text.

The diagnostics gives some hints about inserted and deleted tokens, but is not explicit enough for this use.

I don't see any mention of something like virtual tokens in the libadalang specs.

One way to output the list I'm looking for would be to traverse the AST, outputting the tokens implied by the structure. This is a lot of work, although I suspect I can copy code from gnatpp that does mostly the same thing. I can't just use gnatpp; it refuses to output anything when there are syntax errors in the source.

An LSP language server should be able to provide a source edit script for each detected syntax error; if that functionality is in libadalang or ada_language_server somewhere, I could use that.

Is there another way?

NOTE: edited by @raph-amiard for style corrections

raph-amiard · 2020-06-02T08:58:41Z

Hello @stephe-ada-guru,

Really sorry about the time we took to answer. Somehow this got lost in the flux of internal & external issues we handle.

We don't (yet) have a lot of stored information wrt. error recovery, and there is no notion of inserted/deleted tokens, even though we do kind of insert and delete tokens.

"deleted" tokens are stored in ErrorDecl nodes, one token per node, so those should be pretty easy for you to recover.

OTOH, we don't keep track of "inserted" tokens directly, and we don't actually insert any tokens, instead we presume that the tokens were here even if they're not (sort of).

Now, I'm wondering if that would be a worthwhile addition to our parser, because it might make things easier to use. But that's a pretty big overhaul of the current API. We'll discuss this with the team and keep you updated!

stephe-ada-guru · 2020-06-03T09:32:30Z

Raphaël AMIARD <[email protected]> writes:

Hello @stephe-ada-guru, Really sorry about the time we took to answer. Somehow this got lost in the flux of internal & external issues we handle.

Ok. I wrote code based on your current pretty-printer to output the token sequence implied by a syntax tree. Tedious, but straight-forward. I'm now working on doing the same for the tree-sitter parser, which claims to have good error recovery. First I have to port my Ada grammar to their grammar file syntax, which is turning out to be more work than I anticipated. It doesn't use BNF syntax; it's closer to your grammar file syntax. Once that's done, I'll get back to writing the paper that describes all this. Writing Ada code is always more fun than writing Latex prose about Ada code :), so it will take a while.

We don't (yet) have a lot of stored information wrt. error recovery, and there is no notion of inserted/deleted tokens, even though we do kind of insert and delete tokens. "deleted" tokens are stored in `ErrorDecl` nodes, one token per node, so those should be pretty easy for you to recover. OTOH, we don't keep track of "inserted" tokens directly, and we don't actually insert any tokens, instead we presume that the tokens were here even if they're not (sort of). Now, I'm wondering if that would be a worthwhile addition to our parser, because it might make things easier to use. But that's a pretty big overhaul of the current API. We'll discuss this with the team and keep you updated!

Emacs ada-mode uses the insert/delete list to automatically correct any errors in a parameter list before formatting it. However, I implemented that when the formatting code was all in elisp; now it's in Ada, and uses the syntax tree directly (it's an instance of "refactor"). So that's actually not needed any more. The error correction is also available on user request, but I've never actually used it except when testing it. So I suspect this is not a very useful feature. The generalized LR parser does use the length of the insert/delete list as a metric in deciding which parallel parser to terminate when two parsers reach an identical state. I don't understand error correction in a packrat parser, but there must be some sort of choice between possible corrections, where the insert/delete list could be useful, but some other measure of error severity could work as well. I do look at the error messages from the parser, to figure out why indent or highlight is wrong, so good error messages are useful (something my parser is _not_ very good at).

…

-- -- Stephe

raph-amiard changed the title ~~retrieving error correction tokens~~ Retrieving error correction tokens Jun 2, 2020

raph-amiard added the enhancement label Jun 2, 2020

raph-amiard added error recovery parsing labels Jun 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieving error correction tokens #455

Retrieving error correction tokens #455

stephe-ada-guru commented Feb 22, 2020 •

edited by raph-amiard

Loading

raph-amiard commented Jun 2, 2020

stephe-ada-guru commented Jun 3, 2020 via email

Retrieving error correction tokens #455

Retrieving error correction tokens #455

Comments

stephe-ada-guru commented Feb 22, 2020 • edited by raph-amiard Loading

raph-amiard commented Jun 2, 2020

stephe-ada-guru commented Jun 3, 2020 via email

stephe-ada-guru commented Feb 22, 2020 •

edited by raph-amiard

Loading