Change to Sigil Types named according to Erlang types

erlang · Oct 12, 2023 · da73910 · da73910
1 parent 0315567
commit da73910
Showing 1 changed file with 67 additions and 52 deletions.
diff --git a/eeps/eep-0066.md b/eeps/eep-0066.md
@@ -57,7 +57,7 @@ The tokenizer a.k.a. scanner a.k.a. lexer scans the source code
 character sequence and converts it into a sequence of Tokens,
 like atom, variable, string, integer, reserved word,
 punctuation character or operator:
-«`atom`», «`Variable`», «`"string"`», «`123`», «*case*», «`:`» and «`++`».
+`atom`, `Variable`, `"string"`, `123`, *`case`*, `:` and `++`.
 
 The parser takes a sequence of tokens and builds a parse tree,
 AST (Abstract Syntax Tree), according to the Erlang grammar.
@@ -139,12 +139,12 @@ tokenizing and parsing.
 ### Sigil
 
 In a general sense, a [Sigil][3], is a prefix to a variable
-that indicates its type, such as «`$I`» in Basic or Perl,
+that indicates its *type*, such as «`$I`» in Basic or Perl,
 where «`$`» is the sigil and «`I`» is the variable.
 
 Here we define a Sigil as a prefix (and a suffix) to a string literal
-that indicates how it should be interpreted.  The Sigil is
-a syntactic sugar that creates some Erlang term.
+that indicates how it should be *interpreted*.  The Sigil is
+a *syntactic sugar* that creates some Erlang term.
 
 A Sigil string literal consists of:
 
@@ -164,63 +164,57 @@ The [Sigil Type][] may be empty.
 The Sigil Type defines how the [Sigil][] syntactic sugar
 shall be interpreted.  The suggested Sigil Types are:
 
-* «»: the Vanilla [Sigil][].
+* «»: the vanilla (default) [Sigil][].
 
   Creates an Erlang `unicode:unicode_binary()`.
   It is a string represented as a UTF-8 encoded binary,
   equivalent to applying `unicode:characters_to_binary/1`
   on the [String Content][].  The [String Delimiters][]
-  and escape characters work as for regular strings,
+  and escape characters work as they already do for regular strings,
   triple-quoted strings, or quoted atoms in Erlang.
 
   So «`~"abc\d"`» is equivalent to «`<<"abc\d"/utf8>>`», and
   «`~'abc"d'`» is equivalent to «`<<"abc\"d"/utf8>>`».
 
-  «`~"`» would work as «`~s"`» and «`~"""`» would work
-  as «`~S"""`» below, regarding escape characters.
+  Regular strings honour escape sequences but triple-quoted strings
+  are verbatim, so «`~"`» is equivalent to «`~b"`» but
+  «`~"""`» is equivalent to «`~B"""`», as described below.
 
   A simple way to create strings as UTF-8 binaries is supposedly
   the first and most desired missing string feature in Erlang.
-  This sigil does just that and has no other features.
+  This sigil does just that.
 
-* «`s`»: [string in Elixir][4].
+* «`b`»: `unicode:unicode_binary()`
 
-  Creates an Erlang `unicode:unicode_binary()`, handling
-  escape characters in the string content.  Other features
-  such as string interpolation will require other Sigil Types
-  or using the [Sigil Suffix][].
+  Creates a UTF-8 encoded binary, handling escape characters
+  in the string content.  Other features such as string interpolation
+  will require another Sigil Type or using the [Sigil Suffix][].
 
-  Escape characters and other features are the same regardless
-  of which [String Delimiters][] that are used.
+  In Elixir this corresponds to the «`~s`» sigil, a [string][4].
 
-* «`S`»: [string in Elixir][4], verbatim.
+* «`B`»: `unicode:unicode_binary()`, verbatim.
 
-  Creates an Erlang `unicode:unicode_binary()`, with verbatim
-  string content in that only the [end delimiter][] character
-  can be escaped with a «`\`» character.
+  Creates a UTF-8 encoded binary, with verbatim string content
+  in that only the [end delimiter][] character can be escaped
+  with a «`\`» character.
 
-  Which [String Delimiters][] that are used does not matter,
-  except that between triple-quote delimiters according to
-  [EEP 64][] there is no end delimiter character to escape.
+  In Elixir this corresponds to the «`~S`» sigil, a [string][4].
 
-* «`c`»: [charlist in Elixir][4].
+* «`s`»: `string()`.
 
-  Creates an Erlang `string()`, handling escape characters
+  Creates a Unicode codepoint list, handling escape characters
   in the string content.  Other features such as string interpolation
-  will require other Sigil Types or using the [Sigil Suffix][].
+  will require another Sigil Type or using the [Sigil Suffix][].
 
-  Escape characters and other features are the same regardless
-  of which [String Delimiters][] that are used.
+  In Elixir this corresponds to the «`~c`» sigil, a [charlist][5].
 
-* «`C`»: [charlist in Elixir][4], verbatim.
+* «`S`»: `string()`, verbatim.
 
-  Creates an Erlang `string()`, with verbatim string content
+  Creates a Unicode codepoint list, with verbatim string content
   in that only the [end delimiter][] character can be escaped
   with a «`\`» character.
 
-  Which [String Delimiters][] that are used does not matter,
-  except that between triple-quote delimiters according to
-  [EEP 64][] there is no end delimiter character to escape.
+  In Elixir this corresponds to the «`~C`» sigil, a [charlist][5].
 
 * «`r`»: regular expression.
 
@@ -240,7 +234,7 @@ shall be interpreted.  The suggested Sigil Types are:
   there is no end delimiter character to escape.
 
   The main advantage of a regular expression [Sigil][] is to avoid
-  the additional escaping of «`\`» that regular erlang strings add.
+  the additional escaping of «`\`» that regular erlang strings require.
 
   Today: `re:run(Subject, "^\\s*\"[a-z]+\\\\\\d+\"", [caseless,unicode])`
 
@@ -250,13 +244,23 @@ shall be interpreted.  The suggested Sigil Types are:
   such as making the `re` module recognize this tuple format,
   and having the code loader pre-compile them.
 
-This EEP proposes that other Sigil Types should cause an error
-"illegal sigil type" in the tokenizer or the parser.  Another
-possibility would be to pass them further in the compilation
-chain to allow parse transforms to act on them, but that feature
-can be added later, and in general one should avoid
-using parse transforms since they are often a source for
-hard to find problems.
+Other, unknown, Sigil Types should cause an error "illegal sigil type"
+in the tokenizer or the parser.  Another possibility would be
+to pass them further in the compilation chain enabling parse transforms
+to act on them, but that feature can be added later, and in general
+one should avoid using parse transforms since they are often a source
+for hard to find problems.
+
+These proposed Sigil Types are named according to the corresponding
+Erlang types.  The Sigil Types in [Elixir][1] are named according to
+Elixir types.  So, for example, a «`~s`» Sigil Type in Erlang
+creates an Erlang `string()`, which is a list of Unicode codepoints,
+but in Elixir the «`~s`» Sigil Type creates an Elixir [String][4]
+which is a UTF-8 encoded binary.
+
+Consistency within the language is supposedly more important
+that between the languages, and that the string types are
+different between the languages is already a known quirk.
 
 ### String Delimiters
 
@@ -270,6 +274,12 @@ as end delimiter: single quote «`'`» and double quote «`"`».
 Triple-quote delimiters are also allowed, that is; a sequence of
 3 or more double quote «`"`» characters as described in [EEP 64][].
 
+Which String Delimiters that are used does not affect how
+the string content is interpreted, except that the end delimiter
+may require special handling.  Not for a triple-quoted string,
+though, since conceptually, the end delimiter cannot occur
+in the string's content.
+
 ### String Content
 
 Between the start and end [String Delimiters][], all characters
@@ -372,11 +382,17 @@ should represent an *uncompiled* regular expression with compile flags.
 
 ### Comparison with Elixir
 
-An empty [Sigil Type][] is not allowed in Elixir.
+The [Vanilla Sigil][] (empty [Sigil Type][]) is not allowed in Elixir.
+
+The string and binary [Sigil Type][]s are named differently
+between the languages, to keep the names consistent within
+the language (Erlang): «`~s`» in Elixir is «`~b`» in Erlang,
+and «`~c`» in Elixir is «`~s`» in Erlang, so «`~s`» means
+different things, because strings are different things.
 
 When Elixir allows escape sequences in the [String Content][]
-it also allows string interpolation.  This EEP avoids the topic
-of string interpolation.
+it also allows string interpolation.  This EEP proposes to *not*
+implement string interpolation in the suggested [Sigil Type][]s.
 
 There are small differences in which escape sequences that are implemented
 in the languages; Elixir allows escaping of newlines, and has
@@ -386,26 +402,22 @@ There are also small differences in how newlines are handled
 between «`~S`» heredocs in Elixir and triple-quoted strings in Erlang.
 See [EEP 64][].
 
-Adding the [Vanilla Sigil][], «`~`» to an Erlang regular string
-or triple-quoted string creates a UTF-8 encoded binary equivalent
-to the corresponding [Elixir][1] string or «`~S`» heredoc.
-
 Details about regular expression sigils, «`~r`», in particular
 their [Sigil Suffix][]es remains to be decided in Erlang.
 
 It has not been decided how or even *if* string interpolation
 in will be implemented in Erlang, but a [Sigil Suffix][] or
-a [Sigil Type][] would most probably be used.
+new [Sigil Type][]s would most probably be used.
 
 Reference Implementation
 ------------------------
 
 [PR-7684][] Implements the basics of handling Sigils on string literals.
-The tokenizer produces a «`sigil`» token before the string literal, and a
-«`sigil_suffix`» token after.  The parser merges and transforms them
+The tokenizer produces a `sigil` token before the string literal, and a
+`sigil_suffix` token after.  The parser merges and transforms them
 into the correct output term.
 
-Another approach would be to produce (for example) a «`sigil_string`» token
+Another approach would be to produce (for example) a `sigil_string` token
 for the whole string and then handle that in the parser.
 It would require more state to be kept in the tokenizer between
 the parts of the sigil prefixed string, and therefore need
@@ -423,6 +435,9 @@ more tokenizer rewriting.
 [4]:     https://elixir-lang.org/getting-started/basic-types.html#strings
          "The Elixir Programming Language: Getting Started - Basic Types - Strings"
 
+[5]:     https://elixir-lang.org/getting-started/binaries-strings-and-char-lists.html#charlists
+         "The Elixir Programming Language: Getting Started - Binaries, strings, and charlists - Charlists"
+
 [EEP 64]:     https://www.erlang.org/eeps/eep-0064.md
               "EEP 64: Triple-Quoted Strings"