Skip to content

Commit

Permalink
Change to Sigil Types named according to Erlang types
Browse files Browse the repository at this point in the history
  • Loading branch information
RaimoNiskanen committed Oct 12, 2023
1 parent 0315567 commit da73910
Showing 1 changed file with 67 additions and 52 deletions.
119 changes: 67 additions & 52 deletions eeps/eep-0066.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ The tokenizer a.k.a. scanner a.k.a. lexer scans the source code
character sequence and converts it into a sequence of Tokens,
like atom, variable, string, integer, reserved word,
punctuation character or operator:
«`atom`», «`Variable`», «`"string"`», «`123`», «*case*», «`:`» and «`++`».
`atom`, `Variable`, `"string"`, `123`, *`case`*, `:` and `++`.

The parser takes a sequence of tokens and builds a parse tree,
AST (Abstract Syntax Tree), according to the Erlang grammar.
Expand Down Expand Up @@ -139,12 +139,12 @@ tokenizing and parsing.
### Sigil

In a general sense, a [Sigil][3], is a prefix to a variable
that indicates its type, such as «`$I`» in Basic or Perl,
that indicates its *type*, such as «`$I`» in Basic or Perl,
where «`$`» is the sigil and «`I`» is the variable.

Here we define a Sigil as a prefix (and a suffix) to a string literal
that indicates how it should be interpreted. The Sigil is
a syntactic sugar that creates some Erlang term.
that indicates how it should be *interpreted*. The Sigil is
a *syntactic sugar* that creates some Erlang term.

A Sigil string literal consists of:

Expand All @@ -164,63 +164,57 @@ The [Sigil Type][] may be empty.
The Sigil Type defines how the [Sigil][] syntactic sugar
shall be interpreted. The suggested Sigil Types are:

* «»: the Vanilla [Sigil][].
* «»: the vanilla (default) [Sigil][].

Creates an Erlang `unicode:unicode_binary()`.
It is a string represented as a UTF-8 encoded binary,
equivalent to applying `unicode:characters_to_binary/1`
on the [String Content][]. The [String Delimiters][]
and escape characters work as for regular strings,
and escape characters work as they already do for regular strings,
triple-quoted strings, or quoted atoms in Erlang.

So «`~"abc\d"`» is equivalent to «`<<"abc\d"/utf8>>`», and
«`~'abc"d'`» is equivalent to «`<<"abc\"d"/utf8>>`».

«`~"`» would work as «`~s"`» and «`~"""`» would work
as «`~S"""`» below, regarding escape characters.
Regular strings honour escape sequences but triple-quoted strings
are verbatim, so «`~"`» is equivalent to «`~b"`» but
«`~"""`» is equivalent to «`~B"""`», as described below.

A simple way to create strings as UTF-8 binaries is supposedly
the first and most desired missing string feature in Erlang.
This sigil does just that and has no other features.
This sigil does just that.

* «`s`»: [string in Elixir][4].
* «`b`»: `unicode:unicode_binary()`

Creates an Erlang `unicode:unicode_binary()`, handling
escape characters in the string content. Other features
such as string interpolation will require other Sigil Types
or using the [Sigil Suffix][].
Creates a UTF-8 encoded binary, handling escape characters
in the string content. Other features such as string interpolation
will require another Sigil Type or using the [Sigil Suffix][].

Escape characters and other features are the same regardless
of which [String Delimiters][] that are used.
In Elixir this corresponds to the «`~s`» sigil, a [string][4].

* «`S`»: [string in Elixir][4], verbatim.
* «`B`»: `unicode:unicode_binary()`, verbatim.

Creates an Erlang `unicode:unicode_binary()`, with verbatim
string content in that only the [end delimiter][] character
can be escaped with a «`\`» character.
Creates a UTF-8 encoded binary, with verbatim string content
in that only the [end delimiter][] character can be escaped
with a «`\`» character.

Which [String Delimiters][] that are used does not matter,
except that between triple-quote delimiters according to
[EEP 64][] there is no end delimiter character to escape.
In Elixir this corresponds to the «`~S`» sigil, a [string][4].

* «`c`»: [charlist in Elixir][4].
* «`s`»: `string()`.

Creates an Erlang `string()`, handling escape characters
Creates a Unicode codepoint list, handling escape characters
in the string content. Other features such as string interpolation
will require other Sigil Types or using the [Sigil Suffix][].
will require another Sigil Type or using the [Sigil Suffix][].

Escape characters and other features are the same regardless
of which [String Delimiters][] that are used.
In Elixir this corresponds to the «`~c`» sigil, a [charlist][5].

* «`C`»: [charlist in Elixir][4], verbatim.
* «`S`»: `string()`, verbatim.

Creates an Erlang `string()`, with verbatim string content
Creates a Unicode codepoint list, with verbatim string content
in that only the [end delimiter][] character can be escaped
with a «`\`» character.

Which [String Delimiters][] that are used does not matter,
except that between triple-quote delimiters according to
[EEP 64][] there is no end delimiter character to escape.
In Elixir this corresponds to the «`~C`» sigil, a [charlist][5].

* «`r`»: regular expression.

Expand All @@ -240,7 +234,7 @@ shall be interpreted. The suggested Sigil Types are:
there is no end delimiter character to escape.

The main advantage of a regular expression [Sigil][] is to avoid
the additional escaping of «`\`» that regular erlang strings add.
the additional escaping of «`\`» that regular erlang strings require.

Today: `re:run(Subject, "^\\s*\"[a-z]+\\\\\\d+\"", [caseless,unicode])`

Expand All @@ -250,13 +244,23 @@ shall be interpreted. The suggested Sigil Types are:
such as making the `re` module recognize this tuple format,
and having the code loader pre-compile them.

This EEP proposes that other Sigil Types should cause an error
"illegal sigil type" in the tokenizer or the parser. Another
possibility would be to pass them further in the compilation
chain to allow parse transforms to act on them, but that feature
can be added later, and in general one should avoid
using parse transforms since they are often a source for
hard to find problems.
Other, unknown, Sigil Types should cause an error "illegal sigil type"
in the tokenizer or the parser. Another possibility would be
to pass them further in the compilation chain enabling parse transforms
to act on them, but that feature can be added later, and in general
one should avoid using parse transforms since they are often a source
for hard to find problems.

These proposed Sigil Types are named according to the corresponding
Erlang types. The Sigil Types in [Elixir][1] are named according to
Elixir types. So, for example, a «`~s`» Sigil Type in Erlang
creates an Erlang `string()`, which is a list of Unicode codepoints,
but in Elixir the «`~s`» Sigil Type creates an Elixir [String][4]
which is a UTF-8 encoded binary.

Consistency within the language is supposedly more important
that between the languages, and that the string types are
different between the languages is already a known quirk.

### String Delimiters

Expand All @@ -270,6 +274,12 @@ as end delimiter: single quote «`'`» and double quote «`"`».
Triple-quote delimiters are also allowed, that is; a sequence of
3 or more double quote «`"`» characters as described in [EEP 64][].

Which String Delimiters that are used does not affect how
the string content is interpreted, except that the end delimiter
may require special handling. Not for a triple-quoted string,
though, since conceptually, the end delimiter cannot occur
in the string's content.

### String Content

Between the start and end [String Delimiters][], all characters
Expand Down Expand Up @@ -372,11 +382,17 @@ should represent an *uncompiled* regular expression with compile flags.

### Comparison with Elixir

An empty [Sigil Type][] is not allowed in Elixir.
The [Vanilla Sigil][] (empty [Sigil Type][]) is not allowed in Elixir.

The string and binary [Sigil Type][]s are named differently
between the languages, to keep the names consistent within
the language (Erlang): «`~s`» in Elixir is «`~b`» in Erlang,
and «`~c`» in Elixir is «`~s`» in Erlang, so «`~s`» means
different things, because strings are different things.

When Elixir allows escape sequences in the [String Content][]
it also allows string interpolation. This EEP avoids the topic
of string interpolation.
it also allows string interpolation. This EEP proposes to *not*
implement string interpolation in the suggested [Sigil Type][]s.

There are small differences in which escape sequences that are implemented
in the languages; Elixir allows escaping of newlines, and has
Expand All @@ -386,26 +402,22 @@ There are also small differences in how newlines are handled
between «`~S`» heredocs in Elixir and triple-quoted strings in Erlang.
See [EEP 64][].

Adding the [Vanilla Sigil][], «`~`» to an Erlang regular string
or triple-quoted string creates a UTF-8 encoded binary equivalent
to the corresponding [Elixir][1] string or «`~S`» heredoc.

Details about regular expression sigils, «`~r`», in particular
their [Sigil Suffix][]es remains to be decided in Erlang.

It has not been decided how or even *if* string interpolation
in will be implemented in Erlang, but a [Sigil Suffix][] or
a [Sigil Type][] would most probably be used.
new [Sigil Type][]s would most probably be used.

Reference Implementation
------------------------

[PR-7684][] Implements the basics of handling Sigils on string literals.
The tokenizer produces a «`sigil`» token before the string literal, and a
«`sigil_suffix`» token after. The parser merges and transforms them
The tokenizer produces a `sigil` token before the string literal, and a
`sigil_suffix` token after. The parser merges and transforms them
into the correct output term.

Another approach would be to produce (for example) a «`sigil_string`» token
Another approach would be to produce (for example) a `sigil_string` token
for the whole string and then handle that in the parser.
It would require more state to be kept in the tokenizer between
the parts of the sigil prefixed string, and therefore need
Expand All @@ -423,6 +435,9 @@ more tokenizer rewriting.
[4]: https://elixir-lang.org/getting-started/basic-types.html#strings
"The Elixir Programming Language: Getting Started - Basic Types - Strings"

[5]: https://elixir-lang.org/getting-started/binaries-strings-and-char-lists.html#charlists
"The Elixir Programming Language: Getting Started - Binaries, strings, and charlists - Charlists"

[EEP 64]: https://www.erlang.org/eeps/eep-0064.md
"EEP 64: Triple-Quoted Strings"

Expand Down

0 comments on commit da73910

Please sign in to comment.