-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New and improved HOL parser #1333
Conversation
I think I disagree about having comments in inner syntax being allowed to affect what’s going on at the "outer" level. Certainly, if one sees nested logical material as glorified forms of string literal, then having this happen would be akin to having the following C code work:
The other issue, which I think is also present in the existing code, is that we should allow the |
As per discussion on the Discord, I'm super-happy with the handling of caret. (The thinking behind the earlier weirdness is that we could use |
Well no, because C code doesn't have comments in strings. And neither does HOL in its string literals. Quotes are not a kind of string literal, they are a full-fledged piece of syntax within which you can refer to HOL objects instead of ML objects. It's really not possible to treat them like string literals because antiquotation lets you fit arbitrary ML syntax inside them. To me, it's more like the following code: char *c = { foo /* { bar bar aux */ }; (This intuition is much stronger for If there was actually a phase distinction between the inner and outer syntax, this might make sense, but the reason QuoteFilter has to parse inner comments in the first place is because there is no such phase distinction, because inner comments can comment out antiquotations. For |
If I write
this is surely an error in 99.9999% of all cases. I don't really care what rationalisation is required to achieve it. |
Doesn't the same reasoning apply to
? But also, I'm not sure I agree with that in the sense that one not unreasonable thing to do with comments is block out an entire declaration, and I feel like having "strong terminators" breaks the ability to do nested comments and the like. |
8b59927
to
53eba05
Compare
Your example should clearly be an error also, yes. Nested comments don't require the ability to span levels with comments. Viz:
I find it hard to imagine a scenario where a user would really want to escape the nesting structure of the text with their comments. It also forces us to have the same comment syntax at the HOL level as at the SML level, whereas now it’s just a “happy accident”. Fundamentally, the language of HOL is completely separate from that of SML (it's in the name: "meta-language") and comments should respect that. (Anti-quotation is a cute and useful way to inject from meta to object; having comments in the object syntax affect the meta is gross.) |
Note that the issue I hinted at already occurs in the wild: I had to add some additional hacks to support this definition
because the backquote inside a HOL comment was interpreted as ending the quote. This is the kind of thing I'm concerned about. |
Here's my latest thinking on what I want. We define an obvious grammar for *Script.sml files, where we extend SML with new top-level declaration forms with rules like
We also have
here {ident} and {HOLmaterial} are lexical forms, definable with regular expressions. The angle brackets denote non-terminals (and The regular expression defining {HOLmaterial} allows anything except occurrences of "quote-keywords", where quote-keywords are "^Theorem", "^Proof" ... plus the ASCII backquotes and the Unicode quotes. Now in the absence of anti-quotations this is all we need and is capable of finding and dealing with all static SML issues. In the presence of anti-quoted SML code in the {HOLmaterial} we can get static SML errors and of course, the translation of this file into SML text has to also know about/handle the antiquotes. So after parsing according to the above, the HOLmaterial has to be separately lexed in a simple-minded way one more time. This is scanning for This second lexing process doesn't need to inform the HOL parsing stack, but might flag errors (malformed string literals, unterminated comments). Equally, those errors can be left to be caught by the HOL parser. |
Thanks for your work on this! |
This is effectively a complete rewrite of the
QuoteFilter
module. The intention is to support additional introspection features on the parsing process, but for this initial PR the main goal is to just replicate the behavior of the original parser. A summary of user-facing features:QuoteFilter
is replaced byHolLex
andQFRead
byHolParser
. The interface ofQuoteFilter
->HolLex
has changed somewhat, butHolParser
is basically a drop in replacement forQFRead
.unquote
command supports the old parser usingunquote -q
. In particular, the new parser does not (currently?) support the--quotefix
option sounquote
will use the old parser for that.unquote
function also makes use of the new parser's ability to report (non-fatal! with recovery!) parse errors, so you can run this command over all your.sml
files to spot any invalid inputs that were working by accident before. Two examples of this were found in this repository:borelScript.sml
has some cases of mismatched quotes i.e.‘term`, `term’
ttt_demoScript.sml
has an unmatched parenthesis, so presumably this file is not being checked?There are some deliberate (/ known and pending discussion) parser changes:
Comments in quotations are no longer auto-terminated by the next close quote orremovedEnd
, and will in fact comment out the end quote. I think this makes more sense in a world where HOL is one unified language with comments allowed anywhere with no special status given toEnd
.^^
has been removed.^x
and^(expr)
start antiquotations inside a quote, anything else involving^
is simply embedded as regular text in the quotation. So^^
is copied verbatim into the underlying string.A summary of design changes:
QuoteFilter
which does parsing at the same time as stringification into SML, the two steps have been separated.HolLex
still does the lexing and parsing, but it targets an AST structure instead of producing strings. As a result it is significantly simpler, needing to carry less state around. It makes a lot of use ofcontinue()
to be able to parse commands to completion, which avoids needing one giant state machine as in the previous implementation. It is still a streaming parser at the top level.HolParser
, which contains a streaming implementation of the stringification to SML code.HolLex
no longer tracks line numbers, soHolParser.ToSML
does that. (Part of the motivation for this is becausequotefix
was previously implemented using if statements around everything, and it will be easier to implement here as a separate pass which does a different kind of stringification.)HolParser.Simple
, you get a stream of ASTTopDecl
elements. These are all marked up with positions in the input, so you need the original source on hand to cross reference with it.HolParser.ToSML
, you get a stream of marked text regions:regular
is for input text copied verbatimaux
is for interstitial raw text to call internal functions likeQ.prove
strcode
is for input text which should be escaped to fit in an ML stringstrstr
is for input text which is already in an ML string context, but needs escaping on raw unicode charactersHolParser.ToSML
in combination withmkStrcode
which just encodes the four kinds of text above as appropriate, you get just a plain string out. All of the original entry points are using this now.mkDoubleReader
, which gives a way to have a streaming reader which is followed by a second reader (which can be conceptualized as doing "skip to index i, read to j >= i, skip to k >= j, ...") which outputs results yielded by the first reader, without buffering the entire contents. This is possibly overkill in this situation, and it does lead to some limitations (the marked-up text usingregular
etc has to come in the same order as in the original source), but at least you can read gigabyte files this way.