- Introduction
- Contributing
- Encoding of text
- R’s text representation
- R Connections and encodings
- Printing text to the console
- Issues when building packages
- Unit tests
- Graphics
- How-to
- How to query the current native encoding?
- How is the
encoding
argument and option used? - How to read lines from UTF-8 files?
- How to write lines to UTF-8 files?
- How to convert text between encodings?
- How to guess the encoding of text?
- How to check if a string is UTF-8?
- How to download web pages in the right encoding?
- Are RDS files encoding-safe?
- How to
parse()
UTF-8 text into UTF-8 code? - How to
deparse()
UTF-8 code into UTF-8 text? - How to include UTF-8 characters in a package
DESCRIPTION
- How to get a package
DESCRIPTION
in UTF-8? - How to use UTF-8 file names on Windows?
- How to capture UTF-8 output?
- How to test non-ASCII output?
- How to avoid non-ASCII characters in the manual?
- How to include non-ASCII characters in PDF vignettes?
- Encoding related R functions and packages
- Known encoding issues in packages and functions
- Tips for debugging encoding issues
- Text transformers
- Changes in R 4.1.0
- Changes in R 4.2.0
- Further Reading
The goal of this document is twofold: (1) collect current best practices to solve text encoding related issues and (2) include links to further information about text encoding.
Encoding of text is a large topic, and to make this FAQ useful, we need your contributions. There are a variety of ways to contribute:
- Found a typo a broken link or some example code is not working for you? Please open an issue about it. Or even better, consider fixing it in a pull request.
- Something is missing? There is a new (or old) package of function that should be mentioned here? Please open an issue about it.
- Some information is wrong or something is not explained well? Please open an issue about it. Or even better, fix it in a pull request!
- You found an answer useful? Please tweet about it and link to this FAQ.
Your contributions are most appreciated.
Good general introductions to text encodings:
See the R Internals manual:
https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Encodings-for-CHARSXPs
Some notes about connections.
- Connections assume that the input is in the encoding specified by
getOption("encoding")
. They convert text to the native encoding. - Functions that create connections have an
encoding
argument, that overridesgetOption("encoding")
. - If you tell a connection that the input is already in the native encoding, then it will not convert the input. If the input is in fact not in the native encoding, then it is up to the user to mark it accordingly, e.g. as UTF-8.
R converts strings to the native encoding when printing them to the screen. This is important to know when debugging encoding problems: two strings that print the same way may have a different internal representation:
s1 <- "\xfc"
Encoding(s1) <- "latin1"
s2 <- iconv(s1, "latin1", "UTF-8")
s1
#> [1] "ü"
s2
#> [1] "ü"
s1 == s2
#> [1] TRUE
identical(s1, s2)
#> [1] TRUE
testthat::expect_equal(s1, s2)
testthat::expect_identical(s1, s2)
Encoding(s1)
#> [1] "latin1"
Encoding(s2)
#> [1] "UTF-8"
charToRaw(s1)
#> [1] fc
charToRaw(s2)
#> [1] c3 bc
Some Unicode glyphs, e.g. typical emojis, are wide. They are supposed to take up two characters in a monospace font. The set of wide glyphs (just like the set of glyphs in general) has been changing in different Unicode versions.
Up to version 4.0.3 R followed Unicode 8 (released in 2015) in terms of
character width, so nchar(..., type = "width")
miscalculated the width
of many Asian characters.
R 4.0.4 follows Unicode 12.1.
R 4.1.0 - R 4.3.3 follow Unicode 13.0.
Current R-devel (will be R 4.4.0 in about a month) still uses Unicode 13.0.
To correctly calculate the display width of Unicode text, across various
R versions, you can use the utf8 package, or cli::ansi_nchar()
.
Note that some terminals, and also RStudio follow an older Unicode version of the standard, and they also calculate the display width incorrectly, typically printing characters on top of each other.
See ‘Writing R Extensions’ about including UTF-8 characters in packages: https://cran.r-project.org/doc/manuals/R-exts.html#Encoding-issues
I suggest that you keep your source files in ASCII, except for comments, which you can write in UTF-8 if you like. (But not necessarily in roxygen2 comments, which will go in the manual, see below!)
If you need a string containing non-ASCII characters, construct it with
\uxxxx
or \U{xxxxxx}
escape sequences, which yields a UTF-8-encoded
string.
one_and_a_half <- "\u0031\u00BD"
one_and_a_half
#> [1] "1½"
Encoding(one_and_a_half)
#> [1] "UTF-8"
charToRaw(one_and_a_half)
#> [1] 31 c2 bd
Here are some handy ways to find the Unicode code points for an existing string:
- Copy and paste into the Unicode character inspector.
sprintf("\\u%04X", utf8ToInt(x))
- Do we have other suggestions?
If you need to include non-UTF-8 non-ASCII characters, e.g. for testing,
then include them via raw data, or via \xxx
escape sequences. E.g. to
include a latin1 string, you can do either of these:
t1 <- "t\xfck\xf6rf\xfar\xf3g\xe9p"
Encoding(t1) <- "latin1"
t2 <- rawToChar(as.raw(c(
0x74, 0xfc, 0x6b, 0xf6, 0x72, 0x66, 0xfa, 0x72,
0xf3, 0x67, 0xe9, 0x70)))
Encoding(t2) <- "latin1"
t1 == t2
#> [1] TRUE
identical(charToRaw(t1), charToRaw(t2))
#> [1] TRUE
Symbols must be in the native encoding, so best practice is to only use ASCII symbols.
This is not completely equivalent to using ASCII code files, because names, e.g. in a named list, or the row names in a data frame are also symbols in R. So do not use non-ACSII names. (Another reason to avoid row names in data frames.)
A typical mistake is to assign file names as list or row names when working with files. While it might be a convenient representation, it is a bad idea because of the possible encoding errors.
Similarly, always use ASCII column names.
Unit test files are code files, so the same rule applies to them: keep them ASCII. See the ‘How-to’ section below on specific testing related tasks.
You can use some UTF-8 characters in manual pages, if you include
\endoding{UTF-8}
in the manual page, or you declared Encoding: UTF-8
in the DESCRIPTION
file of the package.
Unfortunately these do not include the characters typically used in the
output of tibble and cli. The problem is with the PDF manual and LaTeX,
so you can conditionally include UTF-8 in the HTML output with the \if
Rd command.
If your PDF manual does not build because of a non-supported UTF-8
character has sneaked in somewhere, tools::showNonASCIIfile()
can show
where exactly they are.
TODO
TODO
R uses the native encoding extensively. Some examples:
- Symbols (i.e. variable names, names in a named list, column and row names, etc.) are always in the native encoding.
- Strings marked with
"unknown"
encoding are assumed to be in this encoding. - Output connections assume that the output is in this encoding.
- Input connection convert the input into this encoding.
enc2native()
encodes its input into this encoding.- Printing to the screen (via
cat()
,print()
,writeLines()
, etc.) re-encodes strings into this encoding.
It is surprisingly tricky to query the current native encoding, especially on older R versions. Here are a number of things to try:
-
Call
l10n_info()
. If the native encoding is UTF-8 or latin1 you’ll see that in the output:❯ l10n_info() $MBCS [1] TRUE $`UTF-8` [1] TRUE $`Latin-1` [1] FALSE
-
On Windows, you’ll also see the code page:
❯ l10n_info() $MBCS [1] FALSE $`UTF-8` [1] FALSE $`Latin-1` [1] TRUE $codepage [1] 1252 $system.codepage [1] 1252
Append the codepage number to
"CP"
to get a string that you can use iniconv()
. E.g. in this case it would be"CP1252"
. -
On Unix, from R 4.1 the name of the encoding is also included in the answer. This is useful on non-UTF-8 systems:
❯ l10n_info() $MBCS [1] FALSE $`UTF-8` [1] FALSE $`Latin-1` [1] FALSE $codeset [1] "ISO-8859-15"
-
On older R versions this field is not included. On R 3.5 and above you can use the following trick to find the encoding:
❯ rawToChar(serialize(NULL, NULL, ascii = TRUE, version = 3)) [1] "A\n3\n262148\n197888\n5\nUTF-8\n254\n"
Line number 6 (i.e. after the fifth
\n
) in the output is the name of the encoding. On R 4.0.x you can also save an RDS file and then use theinfoRDS()
function on it to see the current native encoding. -
Otherwise can call
Sys.getlocale()
and parsing its output will probably give you an encoding name that works iniconv()
:❯ Sys.getlocale("LC_CTYPE") [1] "en_US.iso885915"
In general functions use the encoding
argument two ways:
- For some functions it specifies the encoding of the input.
- For others it specifies how output should be marked.
For example
file("input.txt", encoding = "UTF-8")
tells file()
that input.txt
is in UTF-8
. On the other hand
scan("input2.txt", what = "", encoding = "UTF-8")
means that the output of scan()
will be marked as having UTF-8
encoding. (scan()
also has a fileEncoding
argument, which specifies
the encoding of the input file.
TODO: the weirdness of readLines()
.
The simplest option is to use the brio package, if you can allow it:
brio::read_lines("file")
(brio also has a read_file()
function if you want to read in the whole
file into a string.)
brio does not currently check if the file is valid in UTF-8. See below how to do that.
The xfun package has a nice function that reads UTF-8 files with base R
only: xfun::read_utf8()
. It currently reads like this, after some
simplification:
read_utf8 <- function (path) {
opts <- options(encoding = "native.enc")
on.exit(options(opts), add = TRUE)
x <- readLines(path, encoding = "UTF-8", warn = FALSE)
x
}
(The original function also checks that the returned lines are valid UTF-8.)
Explanation and notes:
path
must be a file name, or an un-opened connection, or a connection that was opened in the native encoding (i.e. withencoding = "native.enc"
). Otherwiseread_utf8()
might (silently) return lines in the wrong encoding.- If
readLines()
gets a file name, then it opens a connection to it inUTF-8
, so the file is not re-encoded and it also marks the result asUTF-8
. - For extra safety, you can add a check that
path
is a file name. - As far as I can tell, there is no R API to query the encoding of a connection.
TODO: does this work correctly with non-ASCII file names?
Call brio::write_file()
to write a character vector to a file. Notes:
brio::write_file()
converts the input character vector to UTF-8. It uses the marked encoding for this, so if that is not correct, then the conversion won’t be correct, either.- Will handle UTF-8 file names correctly.
brio::write_file()
cannot write to connections currently, because with already opened connections there is no way to tell their encoding.
Kevin Ushey’s excellent blog post has an excellent detailed derivation of this base R function:
write_utf8 <- function(text, path) {
# step 1: ensure our text is utf8 encoded
utf8 <- enc2utf8(text)
upath <- enc2utf8(path)
# step 2: create a connection with 'native' encoding
# this signals to R that translation before writing
# to the connection should be skipped
con <- file(upath, open = "w+", encoding = "native.enc")
on.exit(close(con), add = TRUE)
# step 3: write to the connection with 'useBytes = TRUE',
# telling R to skip translation to the native encoding
writeLines(utf8, con = con, useBytes = TRUE)
}
(This is a slightly more robust version than the one in the blog post.)
TODO: does this work correctly with non-ASCII file names?
The iconv()
base R function can convert between encodings. It ignores
the Encoding()
markings of the input completely. It uses ""
as the
synonym for the native encoding.
The stringi::stri_enc_detect()
function tries to detect the encoding
of a string. This is mostly guessing of course.
Not all bytes and bytes sequences are valid in UTF-8. There are a few alternatives to check if a string is valid:
-
stringi::stri_enc_isutf8()
-
utf8::utf8_valid()
-
The following trick using only base R:
! is.na(iconv(x, "UTF-8", "UTF-8"))
iconv()
returnsNA
for the elements that are invalid in the input encoding.
TODO
No, in general they are not, but the situation is quite good. If you save an RDS file that contains text in some encoding, it is safe to read that file on a different computer (or the same computer with different settings), if at least one these conditions hold:
- all text is either UTF-8 and latin1 encoded, and they are also marked as such, or
- both computers (or settings) have the same native encoding,
- the RDS file has version 3, and the loading platform can represent all characters in the RDS file. This usually holds if the loading platform is UTF-8.
Note that from RDS version 3 the strings in the native encoding are re-encoded to the current native encoding when the RDS file is loaded.
On R 3.6 and before we need to create a connection to create code with UTF-8 strings, from an UTF-8 character vector. It goes like this:
safe_parse <- function(text) {
text <- enc2utf8(text)
Encoding(text) <- "unknown"
con <- textConnection(text)
on.exit(close(con), add = TRUE)
eval(parse(con, encoding = "UTF-8"))
}
- By marking the text as
unknown
we make sure thattextConnection()
will not convert it. - The
encoding
argument ofparse()
marks the output asUTF-8
. - Note that when
parse()
parses from a connection it will lose the source references. You can use thesrcfile
argument ofparse()
to add them back.
TODO
Some fields in DESCRIPTION
may contain non-ASCII characters. Set the
Encoding
field to UTF-8
, to use UTF-8 characters in these:
Encoding: UTF-8
The desc package always converts DESCRIPTION
files to UTF-8
.
You can use functions or a package that already handles UTF-8 file names, e.g. brio or processx. If you need to open a file from C code, you need to do the following:
- Convert the file name to UTF-8 just before passing it to C. In generic
code
enc2utf8()
will do. - In the C code, on Windows, convert the UTF-8 path to UTF-16 using the
MultiByteToWideChar()
Windows API function. - Use the
_wfopen()
Windows API function to open the file.
Here is an example from the brio package: https://github.com/r-lib/brio/blob/2cf72bb77ad55c758b4a140112916ddc23f00b59/src/brio.c#L9
This is only possible on UTF-8 systems, and there is no portable way
that works on Windows or other non-UTF-8 platforms. If the native
encoding is UTF-8, then sink()
and capture.output()
will create text
in UTF-8.
Capturing UTF-8 output on Windows (or a Unix system that does not support a UTF-8 locale) is not currently possible. (Well, on Windows there is a very dirty trick, but it involves changing an internal R flag, and running the code in a subprocess.)
Somewhat. testthat 3 by default turn off non-ASCII output from cli (and thus tibble, etc.). So your snapshots that use cli facilities will be in ASCII, which is safe to use anywhere.
If you non-ASCII output from other sources, then hopefully there is a switch to turn them off. As far as I can tell, rlang uses cli to produce non-ASCII output, so it should be fine.
If you need to test non-ASCII output produced by the cli package, then
- use snapshot tests,
- call
testthat::local_reproducible_output(unicode = TRUE)
at the beginning of your test, - record your snapshots on a UTF-8 platform.
Your tests will not run on non-UTF-8 platforms. It is not currently possible to record non-ASCII snapshot tests on non-UTF-8 platforms.
If you accidentally included some non-ASCII characters in the manual,
e.g. because you included some UTF-8 output from cli or tibble/pillar,
then you can set options(cli.unicode = FALSE)
at the right place to
avoid them.
The tools::showNonASCIIfile()
function helps finding non-ASCII
characters in a package.
TODO
-
base::charToRaw()
-
base::Encoding()
andbase::`Encoding<-`
-
base::iconv()
-
base::iconvlist()
-
base::nchar()
-
base::l10n_info()
yaml::write_yaml
crashes on Windows on latin1 encoded strings: vubiostat/r-yaml#90
-
charToRaw()
is your best friend. -
Don’t forget, if they print the same, if they are
identical()
, they can still be in a different encoding.charToRaw()
is your best friend. -
testthat::CheckReporter
saves atestthat-problems.rds
file, if there were any test failures. You can get this file from win-builder, R-hub, etc. The file is a version 2 RDS file, so no encoding conversion will be done byreadRDS()
. -
Don’t trust any function that processes text. Some functions keep the encoding of the input, some convert to the native encoding, some convert to UTF-8. Some convert to UTF-8, without marking the output as UTF-8. Different R versions do different re-encodings.
-
Typing in a string on the console is not the same as
parse()
-ing the same code from a file. The console is always assumed to provide text in the native encoding, package files are typically assumed UTF-8. (This is why it is important to use\uxxxx
escape sequences for UTF-8 text.)
-
print()
-
format()
-
normalizePath()
-
basename()
-
encodeString()
TODO
TODO
Good general introductions to text encodings:
R documentation:
-
Encodings for CHARSXPs section in R internals
-
Writing portable packages / Encoding issues in Writing R Extensions
-
Encodings and R – Old and slightly outdated, it gives a good summary of what it took to change R’s internal encoding.
Intros to various encodings:
About Unicode: