-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define verbatim sigil strings as truly verbatim #55
Conversation
If you go ahead with the delimiter change for Erlang, I will open up a discussion to align Elixir with Erlang here and deprecate our escaping of closing delimiters. However, in Elixir, since sigils are used-defined, we probably shouldn't call them verbatim (although it would be nice to align on the naming as well). Also, I am not sure if we should have verbatim regular expressions ( Finally, I am a bit worried about introducing |
Probably deserves its own EEP to define best practices around using non-ascii characters in the language and documenting them. That said, I'm all for « and → and a few others as that's what I've configured my editor to replace << and -> with when displaying files. Having « as a sigil delimiter, which will likely be fairly rare for quite some time, seems like a good way to introduce them. |
Please drop the The guillemets can also be confusing given some languages use them in reversed order for quoting Regarding regular expressions, if "verbatim" here means that backslashes are passed verbatim to the PCRE compiler, then that's what you always want. Nobody wants to write |
They look very different to me. There are characters that can indeed be confused with others but I don't think these are in that category. I also assume the documentation would provide the Unicode character name and numbers.
The language doesn't have to enforce that
I think the difference will come when/if interpolation gets introduced. Which hopefully it won't. |
Agreed no one wants double escapes but I would say that if ~R means “passed verbatim to PCRE” one could say ~B means “passed verbatim to the character escaping” which converts \n to new lines. In other words, saying that the contents are verbatim to some processor is confusing, because it means any implementation can behave differently. Verbatim should be verbatim and that \d should literally match \d and not a digit. So we agree on no double escaping but I am arguing it should be ~r. :) |
Verbatim?From the point of view of this EEP, it is trying do define what a sigil is and how it behaves, without knowing exactly about future sigil backends such as regular expressions. From that point of view it is natural to call a sigil type "verbatim" when all characters up to the end delimiter are passed as they are through the sigil mechanism. But as @josevalim has pointed out (a few times), that is not what the user, the programmer, wants to know. It is how the frontend+backend combination behaves that is interesting. @josevalim: Since there are custom sigils in Elixir - remind me - how is the end of the content decided for different sigil types? I presume the customization implementation cannot affect how the content end is found as in; can the end delimiter be escaped, should the As the Regarding the regular expression sigil(s): as already said there is no use in having the frontent+backend combination verbatim. Since the end char scanning rules are decided from the sigil name only we can say that Anyway, neither «quote chars»@zuiderkwast: I had no idea that they were used as »quote« in a number of languages, and certainly not »quote». It seems to be just Finnish that only use »quote» (as an alternative to "quote"). Swedish seems to also allow »quote«. Therefore it seems to be safe to say that there is a minimal minority (Finnish) that uses »quote». There would be no problem to add The "rigth/left-pointing double angle quotation mark"s are in latin1 (ISO 8859-1). The latin1 range is the character range that Erlang always has been defined in. The letters in latin1 (above 127) are allowed in variable names and unquoted atoms, so they are already in the syntax. But they haven't been used for keywords and such before. I see no technical problem in using them as delimiters. They were actually considered for the binary syntax instead of In this case it is optional to use |
The choice of lowercase/uppercase decides at the tokenizer level if interpolation is enabled and Elixir only handles the escaping of the closing delimiter (which I believe we should align with Erlang, as per the previous message). Everything else is handled by the sigil implementation.
Agreed. It would be nice if we could call all uppercase sigils "verbatim" though. I understand now that you used |
Fine then. :-)
I think it's safe; nobody will be confused. I have seen this style in Swedish books though. It's not that uncommon. Do you have any old printed books around? The quotation marks article on Swedish Wikipedia has this reference: It's not that easy to find scanned books online but here are two screenshots that I found: (from http://www.eom.nu/wp-content/uploads/2018/05/sandebud-1937-del-1.pdf) (from https://www.hembygd.se/nassjo/gesallprovet-nassjo-tryckeriet-25-ar) Though the more common in Swedish are ”…”, both pointing in the same direction, not “…” as in English or „…“ as in German. |
That probably explains why the Finnish has »that» too. Old Swedish influence. Edit: Found one. "Illiaden | Odysén", printed 1963. Uses »...». |
I think introducing into the language syntax characters that aren't easily available on most keyboards is a mistake - I don't think it would look like a serious language feature if I need to copy the characters from documentation just to use it. |
@michalmuskala: "easily available" and "most keyboards" are grey zones, and "need to copy" might be a bit lazy. It should be a solvable problem, if even a problem. One can choose to use other delimiters. |
The great advantage of having |
At least the US keyboard layout (and all others based on it), don't have the character easily available. This already excludes large swaths of the programmer population. |
Google says that they are at [ |
On French keyboards [ |
I hear it is the same on Swedish International. |
On my Polish keyboard I get And this is kind of my point - if I have to google how to type in the programming language's syntax - it's already failing at providing good syntax. |
But |
I second @michalmuskala's doubts about « and » delimiters. @RaimoNiskanen, as you pointed out, these characters are in Latin-1 character set, but outside the ASCII range, and the "easy enough to type" votes here, at least so far, are from speakers of Latin-1 encodable languages. To give a counterexample, these characters are not easily typable on Polish keyboards. I imagine it's similar for any other users of Latin-2-suited keyboards, i.e. all of Central and Eastern Europe. |
If this is the primary use case, I'd consider cc @josevalim |
It's in the Latin-2 set though?
Just to be clear, these characters are not shown on my keyboard either, I simply pressed [Alt Gr] and tried every keys until I found where they are. But ultimately they ended up easy enough to type (just not obvious). How do you type these on Polish keyboards? If it's not a simple [Alt Gr] it might require [Shift] + [Alt Gr] which, while less convenient, is still within acceptable bounds IMO, for a character that will be sparsely used. |
As far as I can tell, there's a kind of reason a lot of languages ended up with literal strings being declared as heredoc strings something like:
if only because if you're not gonna allow escaping, you're going to always have edge cases, so what they all end up doing is having a configurable delimiter with unmistakable syntax, which nobody fully likes because of how much room it takes. Picking other fixed delimiters are always going to inherently trade-off the experience of some users in some contexts. It doesn't matter if you pick
|
@wojtekmach: We have had long internal discussions, which have homed in to that the delimiters "should look like delimiters" as in vertical lines:
And we don't want chars that are easily mistaken so not But Erlang has already been defined for latin-1. All latin-1 letters are allowed in variable names and unquoted atoms. Still almost nobody uses that. Probably to not exclude e.g latin-2 programmers. I'll sleep on this, but since latin-1 is particular to western Europe, some characters excludes eastern Europe and large parts of the rest of the world. This may be the argument against allowing latin-1 for syntax that I have been missing. But And, @essen: « and » are not in latin-2. I'll be back. |
@ferd: Quite right. We have landed on our Here-documents, "triple-quoted strings" that allow an number of We also have non-verbatim strings where all So now we are trying to find the best set of delimiters for verbatim strings where the end delimiter char cannot be escaped. So it is a corner case, but worth to try finding something "optimal". |
It's in code page 852 and Windows code page 1250, which overlap with Latin-2, but are different encodings, and ISO 8859-2 aka Latin-2 is yet a different one.
The point is I can't :| At least not on a Mac, maybe it's different on Windows due to the above ISO vs MS differences. |
Ah I was looking at the wrong "Latin-2" (CP852), my bad. Sounds like those characters may not be a good fit for general use then. |
I need to apologize - I was apparently under the delusion that latin-1 was more universal than it is, but it is just one of the latin-* siblings. It is not a common denominator despite it's status as the base of Unicode. 7-bit US ASCII is the common denominator. (Not entirely, but still...) Guillemets are out, and backtick is in, just because it is in 7-bit US ASCII, uncommon, and used in e.g Markdown for this purpose. Sorry for the noise, and thank you for the counter noise :-) I will also write something in the EEP about why latin-1 is a bad choice, even though it is the character set that Erlang is defined in. |
Excellent! May I add just a tiny bit of noise? As far as I know, backticks are not used anywhere in Erlang and I think we should reserve them for future use. Imagine we need a new syntax for something in 10 years and backticks are no longer an option due to sigils. It is a bit silly and sigils are already gated with a Disclaimer: backticks have no use in Elixir too for similar reasons. |
I was wondering the same thing, I think only |
As @essen says, all suggested delimiters has got other meanings in Erlang today. Only @josevalim: I agree that it itches a bit to add |
It depends on how safe you want to be. Not adding it is 100% conflict-free. Adding it is less than 100%... but probably safe enough. :D Either way is fine, I just thought I would mention it for completeness. |
It is a valid point. We cannot be 100% certain in predicting that no future syntax suggestion will ever collide with the sigil syntax delimiter handling. But I hope we can be sure that any future syntax suggestion can be designed to not collide with the And I do hope that anyone that can see such a danger will see it now, before OTP-27... |
No description provided.