`write_dta()` can error if labelled values are large enough #739

jacob-gg · 2023-11-01T22:59:47Z

Hi, all:

Currently, if labelled values are large enough (beyond the range of a 32-bit integer), write_dta() can error because
an if statement in validate_dta() (called as part of write_dta()) can fail to resolve to TRUE or FALSE.

First, a reprex:

d <- data.frame(x = c(9999999999, 9999999998), y = 1:2)

d$x <- haven::labelled(x = d$x,
                       labels = c('lab1' = 9999999999,
                                  'lab2' = 9999999998))

haven::print_labels(d$x)
#>
#> Labels:
#>  value label
#>  1e+10  lab1
#>  1e+10  lab2

haven::write_dta(d, path = 'out.dta')
#> Warning in is_integerish(attr(x, "labels")): NAs introduced by coercion to
#> integer range
#> Error in if (any(bad_labels)) {: missing value where TRUE/FALSE needed

^{Created on 2023-11-01 with reprex v2.0.2}

When write_dta(data) is executed, validate_dta() is called on data as the first step of the write process. validate_dta() contains the following lines (starting at 158 here):

bad_labels <- vapply(data, has_non_integer_labels, logical(1))
if (any(bad_labels)) {
  cli_abort(c("Stata only supports labelling with integer variables.",
  x = "Problems: {.var {var_names(data, bad_labels)}}"),
  call = call)
}

has_non_integer_labels() (starting at 185 here) is run on each column of data, with a logical expectation. Ideally, all elements of bad_labels end up as TRUE or FALSE. If so, if (any(bad_labels)) can be resolved to TRUE or FALSE and execute (or not) accordingly.

If a column has a non-null labels attribute, is labelled, and is double, then has_non_integer_labels() returns the result of running:

is_integerish(attr(x, "labels"))

on the column in question.

(Sidebar: There may be another issue that arises from passing attr(x, "labels") into is_integerish(); more on that below,
but I'll continue on for the moment.)

The trouble arises if what's passed to is_integerish() are values like 9999999999 and 9999999998, in which case is_integerish() returns NA instead of TRUE or FALSE. is_integerish() coerces its finite, non-missing input values with as.integer(), but since (at least in my understanding) R uses 32-bit integers, it can't successfully coerce a value like 9999999999 to integer.

So: If a value passed into is_integerish() is large enough, at least one element of bad_labels (in validate_dta()) winds up as NA.

If at least one element of bad_labels other than the NA(s) is TRUE, if (any(bad_labels)) will still successfully resolve (to TRUE), at which point write_dta() will throw an expected error with cli::cli_abort(). However, if all of the other elements of bad_labels other than the NA(s) are FALSE, if (any(bad_labels)) can't resolve, the function errors, and no .dta file is written out. E.g., the following if statement can't resolve to TRUE or FALSE:

bad_labels <- c(NA, FALSE, FALSE)
if (any(bad_labels)) {
  1 + 1
}
#> Error in if (any(bad_labels)) {: missing value where TRUE/FALSE needed

^{Created on 2023-11-01 with reprex v2.0.2}

This can cause issues in some "real-world" data sets. For example, trying to convert this ICPSR SPSS .por file to .dta throws an error. (Fair warning: An ICPSR account is needed to download that data set.)

Finally, picking up the sidebar from above: As written, is it possible that the is_integerish(attr(x, 'labels')) call within has_non_integer_labels() operates on column values and not labels? And if so, is that intended? I may be off-base, but here's a reprex showing what I'm referring to:

d <- data.frame(x = 1:2, y = 1:2)
d$x <- haven::labelled(x = d$x,
                       labels = c('lab1' = 1,
                                  'lab2' = 2))

# print_labels() thinks (as I would expect)
# that the labels are "lab1" and "lab2":
haven::print_labels(d$x)
#>
#> Labels:
#>  value label
#>      1  lab1
#>      2  lab2

# But the is_integerish() call within has_non_integer_labels() takes
# attr(x, "labels") as its argument, and attr(d$x, "labels") returns
# a named vector with values 1 and 2 and names "lab1" and "lab2":
attr(d$x, "labels")
#> lab1 lab2
#>    1    2
names(attr(d$x, "labels"))
#> [1] "lab1" "lab2"
sum(attr(d$x, "labels"))
#> [1] 3

# Accordingly, is_integerish() returns TRUE, even when
# given attr(d$x, "labels") as input: 
haven:::is_integerish(attr(d$x, "labels"))
#> [1] TRUE

^{Created on 2023-11-01 with reprex v2.0.2}

It's always possible that I'm misinterpreting something, but it seems to me like is_integerish() currently checks the integer status of variable values, not labels (and it errors if the values are sufficiently large). Since is_integerish() is called inside has_non_integer_labels(), I was expecting the check to be of the integer status of labels.

In summary: I think that write_dta() currently winds up checking the integer status of variable values, not labels, and errors if the values are outside the range of a 32-bit integer.

One possible fix might entail:

Passing names(attr(x, "labels")) into is_integerish() instead of passing in attr(x, "labels")
Checking in is_integerish() before the final all(x_finite == as.integer(x_finite)) line for x_finite values outside the range of a 32-bit integer, and returning FALSE if present (that'd result in has_non_integer_labels() then returning TRUE; bad_labels would then contain >=1 TRUE, and write_dta() would produce an expected error)

If I've identified a real bug that seems worth addressing, I'd be happy to submit a PR with those updates (or alternative tweaks if there's an different, preferable strategy).

haven has been a lifesaver many times. Thanks!

The text was updated successfully, but these errors were encountered:

gorcha added the bug an unexpected problem or unintended behavior label Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`write_dta()` can error if labelled values are large enough #739

`write_dta()` can error if labelled values are large enough #739

jacob-gg commented Nov 1, 2023

write_dta() can error if labelled values are large enough #739

write_dta() can error if labelled values are large enough #739

Comments

jacob-gg commented Nov 1, 2023

`write_dta()` can error if labelled values are large enough #739

`write_dta()` can error if labelled values are large enough #739