New and improved HOL parser #1333

digama0 · 2024-11-10T04:49:27Z

This is effectively a complete rewrite of the QuoteFilter module. The intention is to support additional introspection features on the parsing process, but for this initial PR the main goal is to just replicate the behavior of the original parser. A summary of user-facing features:

QuoteFilter is replaced by HolLex and QFRead by HolParser. The interface of QuoteFilter -> HolLex has changed somewhat, but HolParser is basically a drop in replacement for QFRead.
Most functions have been switched to only use the new parser, but the unquote command supports the old parser using unquote -q. In particular, the new parser does not (currently?) support the --quotefix option so unquote will use the old parser for that.
The unquote function also makes use of the new parser's ability to report (non-fatal! with recovery!) parse errors, so you can run this command over all your .sml files to spot any invalid inputs that were working by accident before. Two examples of this were found in this repository:
- borelScript.sml has some cases of mismatched quotes i.e. ‘term`, `term’
- ttt_demoScript.sml has an unmatched parenthesis, so presumably this file is not being checked?
- There are some more parse errors in CakeML, I will make a separate PR for that (fix parse errors CakeML/cakeml#1085)

There are some deliberate (/ known and pending discussion) parser changes:

Comments in quotations are no longer auto-terminated by the next close quote or End, and will in fact comment out the end quote. I think this makes more sense in a world where HOL is one unified language with comments allowed anywhere with no special status given to End. removed
The escape-like behavior of ^^ has been removed. ^x and ^(expr) start antiquotations inside a quote, anything else involving ^ is simply embedded as regular text in the quotation. So ^^ is copied verbatim into the underlying string.

A summary of design changes:

Rather than having QuoteFilter which does parsing at the same time as stringification into SML, the two steps have been separated. HolLex still does the lexing and parsing, but it targets an AST structure instead of producing strings. As a result it is significantly simpler, needing to carry less state around. It makes a lot of use of continue() to be able to parse commands to completion, which avoids needing one giant state machine as in the previous implementation. It is still a streaming parser at the top level.
The rest of the implementation is in HolParser, which contains a streaming implementation of the stringification to SML code. HolLex no longer tracks line numbers, so HolParser.ToSML does that. (Part of the motivation for this is because quotefix was previously implemented using if statements around everything, and it will be easier to implement here as a separate pass which does a different kind of stringification.)
There are currently three possible entry points to the parser:
- Using HolParser.Simple, you get a stream of AST TopDecl elements. These are all marked up with positions in the input, so you need the original source on hand to cross reference with it.
- Using HolParser.ToSML, you get a stream of marked text regions:
  - regular is for input text copied verbatim
  - aux is for interstitial raw text to call internal functions like Q.prove
  - strcode is for input text which should be escaped to fit in an ML string
  - strstr is for input text which is already in an ML string context, but needs escaping on raw unicode characters
- Using HolParser.ToSML in combination with mkStrcode which just encodes the four kinds of text above as appropriate, you get just a plain string out. All of the original entry points are using this now.
One additional semi-reusable component here is mkDoubleReader, which gives a way to have a streaming reader which is followed by a second reader (which can be conceptualized as doing "skip to index i, read to j >= i, skip to k >= j, ...") which outputs results yielded by the first reader, without buffering the entire contents. This is possibly overkill in this situation, and it does lead to some limitations (the marked-up text using regular etc has to come in the same order as in the original source), but at least you can read gigabyte files this way.

mn200 · 2024-11-11T00:07:51Z

I think I disagree about having comments in inner syntax being allowed to affect what’s going on at the "outer" level. Certainly, if one sees nested logical material as glorified forms of string literal, then having this happen would be akin to having the following C code work:

char *c = "foo /* "bar bar aux */";

The other issue, which I think is also present in the existing code, is that we should allow the Quote form to have almost entirely arbitrary material inside it. That then means that the Quote form of special string literal shouldn’t pre-commit to particular notions of what is/isn’t an inner string literal, nor what comments look like etc.

mn200 · 2024-11-11T00:14:31Z

As per discussion on the Discord, I'm super-happy with the handling of caret. (The thinking behind the earlier weirdness is that we could use ^ as a backslash-analogue, thereby giving us a way of putting back-quote (or the Unicode right-quotation marks) into quotations. I'm not quite sure what, if anything might be done for that.)

digama0 · 2024-11-11T02:18:41Z

I think I disagree about having comments in inner syntax being allowed to affect what’s going on at the "outer" level. Certainly, if one sees nested logical material as glorified forms of string literal, then having this happen would be akin to having the following C code work:

Well no, because C code doesn't have comments in strings. And neither does HOL in its string literals. Quotes are not a kind of string literal, they are a full-fledged piece of syntax within which you can refer to HOL objects instead of ML objects. It's really not possible to treat them like string literals because antiquotation lets you fit arbitrary ML syntax inside them. To me, it's more like the following code:

char *c = { foo /* { bar bar aux */ };

(This intuition is much stronger for Theorem, Definition, Quote etc compared to term quotations “foo” because the former look like block commands and the latter looks like a string literal. This is also reflected in the error handling - quotes like ” are "weak" and when used improperly will cause a parse error and be ignored, while End is a "strong" quote end and when used improperly will forcibly terminate other unfinished things. This kind of error handling is effectively what you are asking for with unclosed comments in a quote, but it would have to be a parse error regardless.)

If there was actually a phase distinction between the inner and outer syntax, this might make sense, but the reason QuoteFilter has to parse inner comments in the first place is because there is no such phase distinction, because inner comments can comment out antiquotations. For Quote, we could certainly turn off comments and antiquotations but I think that's not desirable?

mn200 · 2024-11-11T02:42:46Z

If I write

Theorem name:
   P x /\ (* q x ==> conclusion 
Proof
  tactic
QED

Theorem stmt2:
   Q x /\ *) P 
Proof
  tactic
QED

this is surely an error in 99.9999% of all cases.

I don't really care what rationalisation is required to achieve it.

digama0 · 2024-11-11T02:44:41Z

Doesn't the same reasoning apply to

Theorem name:
   P x /\ ^( q x ==> conclusion 
Proof
  tactic
QED

?

But also, I'm not sure I agree with that in the sense that one not unreasonable thing to do with comments is block out an entire declaration, and I feel like having "strong terminators" breaks the ability to do nested comments and the like.

mn200 · 2024-11-11T22:54:17Z

Doesn't the same reasoning apply to
Theorem name:
   P x /\ ^( q x ==> conclusion 
Proof
  tactic
QED
?

But also, I'm not sure I agree with that in the sense that one not unreasonable thing to do with comments is block out an entire declaration, and I feel like having "strong terminators" breaks the ability to do nested comments and the like.

Your example should clearly be an error also, yes.

Nested comments don't require the ability to span levels with comments. Viz:

(* 
Theorem foo:
   p (* /\ q *) ==> r
Proof
  tactic >> (* badtactic (* really_badtac *) >> might_work_tac *)
  cheat
   (* give up and don't even bother with a QED *)
*)

I find it hard to imagine a scenario where a user would really want to escape the nesting structure of the text with their comments. It also forces us to have the same comment syntax at the HOL level as at the SML level, whereas now it’s just a “happy accident”. Fundamentally, the language of HOL is completely separate from that of SML (it's in the name: "meta-language") and comments should respect that. (Anti-quotation is a cute and useful way to inject from meta to object; having comments in the object syntax affect the meta is gross.)

digama0 · 2024-11-12T07:26:22Z

Note that the issue I hinted at already occurs in the wild: I had to add some additional hacks to support this definition

Definition RN_deriv_def : (* or `v / m` (dv/dm) *)
    RN_deriv v m =
      @f. f IN measurable (m_space m,measurable_sets m) Borel /\
          (!x. x IN m_space m ==> 0 <= f x) /\
          !s. s IN measurable_sets m ==> ((f * m) s = v s)
End

because the backquote inside a HOL comment was interpreted as ending the quote. This is the kind of thing I'm concerned about.

mn200 · 2024-11-14T23:01:04Z

Here's my latest thinking on what I want. We define an obvious grammar for *Script.sml files, where we extend SML with new top-level declaration forms with rules like

<decl> ::= "^Theorem" {ident} <attributes>? ":" {HOLmaterial} "^Proof" <tactic> "^QED" | ...

We also have

<expr> ::= ...
               | “{HOLmaterial}”
               | ‘{HOLmaterial}’
               | ``{HOLmaterial}``
               | ...

here {ident} and {HOLmaterial} are lexical forms, definable with regular expressions. The angle brackets denote non-terminals (and <tactic> is probably just SML ).

The regular expression defining {HOLmaterial} allows anything except occurrences of "quote-keywords", where quote-keywords are "^Theorem", "^Proof" ... plus the ASCII backquotes and the Unicode quotes.

Now in the absence of anti-quotations this is all we need and is capable of finding and dealing with all static SML issues. In the presence of anti-quoted SML code in the {HOLmaterial} we can get static SML errors and of course, the translation of this file into SML text has to also know about/handle the antiquotes. So after parsing according to the above, the HOLmaterial has to be separately lexed in a simple-minded way one more time. This is scanning for ^ident (and maybe ^(...)), which task has to know about HOL's notion of string literal and comment.

This second lexing process doesn't need to inform the HOL parsing stack, but might flag errors (malformed string literals, unterminated comments). Equally, those errors can be left to be caught by the HOL parser.

mn200 · 2024-11-22T02:52:25Z

Thanks for your work on this!

digama0 added 2 commits November 10, 2024 05:21

New and improved HOL parser

3231849

cleanup

b54197d

digama0 mentioned this pull request Nov 10, 2024

fix parse errors CakeML/cakeml#1085

Merged

fix

2a91d0c

digama0 force-pushed the holparser branch from 72eb40e to 2a91d0c Compare November 10, 2024 05:07

mosml fix

ce2519c

digama0 force-pushed the holparser branch from a025e95 to ce2519c Compare November 10, 2024 05:15

digama0 added 10 commits November 10, 2024 07:18

update test

465d3a6

attempt to fix mosml

7fa517e

fix 2

657a45e

fix 3

acecef5

fix 4

daf9755

fix 5

7ea0f8b

bugfix

f2309d2

tweaks to chunking

23dce90

bugfix

430eebe

don't remove QFRead from the prelude yet

03a58d4

digama0 added 2 commits November 11, 2024 01:24

bugfix

c8dd4fe

Merge branch 'develop' into holparser

5b303c3

digama0 force-pushed the holparser branch from a0e38f4 to 5b303c3 Compare November 11, 2024 01:45

digama0 force-pushed the holparser branch from 7dd8d6a to 718f343 Compare November 11, 2024 03:59

close quotes terminate comments

c15d5e5

digama0 force-pushed the holparser branch 2 times, most recently from 8b59927 to 53eba05 Compare November 11, 2024 07:18

digama0 added 2 commits November 14, 2024 09:16

rename #(FNAME) to #(FILE)

63a4afc

Merge branch 'develop' into holparser

77be130

digama0 force-pushed the holparser branch from 470bd0f to 77be130 Compare November 14, 2024 08:16

digama0 added 2 commits November 14, 2024 14:46

tweak positions for bindings

8abc7c0

expose some more internals

568acd0

digama0 force-pushed the holparser branch from 1d8ef02 to 568acd0 Compare November 15, 2024 14:46

mn200 merged commit 468671a into HOL-Theorem-Prover:develop Nov 22, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New and improved HOL parser #1333

New and improved HOL parser #1333

digama0 commented Nov 10, 2024 •

edited

Loading

mn200 commented Nov 11, 2024 •

edited

Loading

mn200 commented Nov 11, 2024

digama0 commented Nov 11, 2024 •

edited

Loading

mn200 commented Nov 11, 2024

digama0 commented Nov 11, 2024 •

edited

Loading

mn200 commented Nov 11, 2024

digama0 commented Nov 12, 2024

mn200 commented Nov 14, 2024 •

edited

Loading

mn200 commented Nov 22, 2024

New and improved HOL parser #1333

New and improved HOL parser #1333

Conversation

digama0 commented Nov 10, 2024 • edited Loading

mn200 commented Nov 11, 2024 • edited Loading

mn200 commented Nov 11, 2024

digama0 commented Nov 11, 2024 • edited Loading

mn200 commented Nov 11, 2024

digama0 commented Nov 11, 2024 • edited Loading

mn200 commented Nov 11, 2024

digama0 commented Nov 12, 2024

mn200 commented Nov 14, 2024 • edited Loading

mn200 commented Nov 22, 2024

digama0 commented Nov 10, 2024 •

edited

Loading

mn200 commented Nov 11, 2024 •

edited

Loading

digama0 commented Nov 11, 2024 •

edited

Loading

digama0 commented Nov 11, 2024 •

edited

Loading

mn200 commented Nov 14, 2024 •

edited

Loading