The future of parsing TLA+ #16

ahelwer · 2024-12-12T15:37:42Z

Opening this discussion after the December community meeting. This regards future efforts to either unify the various TLA+ parsers, or not. I try to give a fair appraisal of the options but I do have my preferences, since I will probably be the one doing this work!

The Problem

Currently there are two main parsers used by TLA+ language tooling:

SANY, a Java-based parser used by TLC
The TLAPM parser, an OCaml-based parser used by TLAPM

We are faced with three possible choices:

Keep things as they are and continue investing features & fixes into both parsers, supported by development of an implementation-independent suite of TLA+ parser tests similar to the existing syntax corpus; the largest remaining feature here is writing a level checker for TLAPM (basically a simple type checker).
Discard TLAPM's parser and switch TLAPM to use SANY, so as to consolidate developer time; to avoid foreign function interface (FFI) weirdness between OCaml and Java, from TLAPM we would shell out to start a Java VM instance, run SANY, consume the parse output from its XML interface, parse the XML into equivalent datastructures in OCaml, and translate those datastructures into TLAPM's existing internal parse tree format.
Adopt a greenfield approach of writing a third TLA+ parser from scratch, probably in rust; then, transition both TLC and TLAPM to use this new parser. This would have to be motivated by substantial new capabilities, discussed below.

Comparison of SANY and TLAPM parsers

Of the two, SANY is more fully-featured. While it does contain bugs, it accepts all important TLA+ syntax and implements proper semantic analysis and level-checking. In contrast, TLAPM's parser does not accept some TLA+ syntax and does not perform level-checking; also, it is the author's understanding that instead of implementing semantic analysis as a graph of references overlaid on the parse tree, TLAPM simply rewrites/inlines all references to form larger expressions. While this rewriting approach does fit with TLAPM's purpose of rewriting TLA+ syntax into proof obligations in other languages, it has drawbacks; for example, it does not support RECURSIVE definitions, since such an expansion would have no end.

It isn't all downside for the TLAPM parser though, and this is really due to OCaml's language features. OCaml has the ability to express variants, where an object might be one of several possible types with their own named & typed fields; it is then possible to exhaustively pattern-match over the type in a switch statement (many programmers will have been introduced to this concept from rust). This comes in especially useful in parsers where a given syntax tree node is one of several possible kinds and we want to handle them all differently. Java in contrast does not have this ability, and so consuming the parse tree usually requires non-exhaustively switching on an enum then casting a value to a subclass. This section should not be discounted as mere programmer language preference; in practical terms, consuming and manipulating the parse tree emitted by SANY is very difficult and requires a lot of implicit & undocumented knowledge about the order that elements are supposed to occur (or not) in variable-length arrays. Usually the only way to tell is to run a debugger and see what the parser spits out.

Benefits & drawbacks of switching TLAPM to use SANY

The primary benefit to making this switch would be reduced development burden. While parsers are a rare breed of software project where the requirements are so well-understood that they are actually possible to complete, there is a very large amount of work remaining to get both parsers to that point. This work is made more difficult by the necessity for multiple developers to understand the parsers in great depth, so as to review each others' work. While this can be worked around by developing very large test suites - to reduce reviewer trepidation when approving changes in unfamiliar parts of the codebase - it remains the case that developers don't like to sign off on changes they don't understand and so PRs tend to languish for one or more months. It is the author's experience that iteration time is the single greatest factor affecting overall development time, and we can reasonably expect a rate of 1-3 changes/month to result in development time of five years or more for a substantial project.

It is the author's opinion that the primary difficulty in transitioning TLAPM to use SANY's parse output is that no developer active with the project has full living knowledge of how both parsers represent the semantic parse tree, and this project requires us to write a translator from one semantic parse tree to another while preserving a large number of unknown invariants. Thus this project cannot be attacked directly; instead, various plausibly-valuable sub-projects must be worked on to cultivate a theory of how both these codebases work in some developer's head. Thankfully, one such sub-project presents itself: development of a large semantic test corpus, similar to the existing syntactic test corpus. This corpus has obvious value on its own and will also be useful in testing a SANY-to-TLAPM parse tree translator.

One illustrative example of the theory shortfall & a clear unknown is the conceptual equivalence of TLAPM's semantic parse tree to SANY's. As mentioned above, TLAPM takes a rewrite-based approach to semantic resolution. Does this present an obstacle to translating SANY's graph-of-references overlay approach? Does this mean there will need to be additional development on the TLAPM parser before the translation can even work? I don't know.

One final drawback to switching TLAPM to use SANY is the overall complexity of the proposed pipeline. While TLAPM is no stranger to calling out to other programs and parsing their output - this is the approach it takes with all its backend theorem provers - replacing an in-codebase in-language parser with a "run SANY/parse XML/translate datastructures" pipeline does seem to add fragility, as well as an additional dependency on the Java runtime (although users of TLAPM are almost certainly already users of Java-based TLC). This also adds additional time to the parse loop. While it is not possible to appeal to objective criteria here, it is the author's opinion that programmers prize snappy language tooling and waiting on the order of an entire second to parse a few hundred lines of TLA+ should not be viewed as acceptable. This is especially true in the new(ish) domain of language servers, where live updates are expected as users type in their editor. As a somewhat niche research-adjacent language we can get away with some long processing times, but I do think we should aim higher here.

Writing a new parser?

I don't forward this as a very serious proposal at this point. Perhaps in half a decade we will have reached the real limit of what we can do with the existing parser(s) and will want to look into this. I took a brief stab last year at writing a new parser in rust but was ultimately unhappy with the idea of just writing another recursive descent parser and it fizzled out. I'm currently working through a university course on formal languages to hopefully exactly characterize what power/variant of automaton is needed to parse TLA+ and what efficient algorithms exists for doing so. Who knows, maybe a TLA+ parser written in TLA+ will come into existence, if only as part of the language standard!

Any new parser would have to be motivated by a substantial new capability that the existing parsers cannot be modified to possess. Something like structured editing and incremental parsing, with error recovery enhanced by running an algorithm to diff edits against prior successfully-parsed trees. Complicated stuff like that. Although really, I think the killer feature would have to be a finite model-checker rewrite to use compiled bytecode for a huge boost in throughput, which might be better to do in a language other than Java. All speculative at this point.

Conclusion

This was long, but also our discussion in the community meeting was long. Please leave any comments or feedback below. Hopefully we can all read this thread and so shorten our discussion at the next community meeting, or establish a parser working group.

The text was updated successfully, but these errors were encountered:

fhackett · 2024-12-12T20:09:10Z

Author of multiple "non major" TLA+ parsers here... most notably PGo, but by now it's almost my hobby to try and write more of them as parser tooling stress tests. Thanks for the write-up, especially given I did not make it to the aforementioned discussion.

I actually didn't know the SANY XML exporter existed. That's interesting. Regardless of parser implementation standardization, I think that a common representation of how a spec is "understood" is valuable for experimenting and qualifying what is or is not a correct parse. I should go learn more about this and incorporate it into my testing process.

That is to say, I think any effort in making parsing more consistent and better defined will also benefit TLA+ tooling that is neither SANY nor TLAPM. 👀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The future of parsing TLA+ #16

The future of parsing TLA+ #16

ahelwer commented Dec 12, 2024

fhackett commented Dec 12, 2024

The future of parsing TLA+ #16

The future of parsing TLA+ #16

Comments

ahelwer commented Dec 12, 2024

The Problem

Comparison of SANY and TLAPM parsers

Benefits & drawbacks of switching TLAPM to use SANY

Writing a new parser?

Conclusion

fhackett commented Dec 12, 2024