write_dta()
can error if labelled values are large enough
#739
Labels
bug
an unexpected problem or unintended behavior
Hi, all:
Currently, if labelled values are large enough (beyond the range of a 32-bit integer),
write_dta()
can error becausean
if
statement invalidate_dta()
(called as part ofwrite_dta()
) can fail to resolve toTRUE
orFALSE
.First, a reprex:
Created on 2023-11-01 with reprex v2.0.2
When
write_dta(data)
is executed,validate_dta()
is called ondata
as the first step of the write process.validate_dta()
contains the following lines (starting at 158 here):has_non_integer_labels()
(starting at 185 here) is run on each column ofdata
, with a logical expectation. Ideally, all elements ofbad_labels
end up asTRUE
orFALSE
. If so,if (any(bad_labels))
can be resolved toTRUE
orFALSE
and execute (or not) accordingly.If a column has a non-null
labels
attribute, is labelled, and is double, thenhas_non_integer_labels()
returns the result of running:on the column in question.
(Sidebar: There may be another issue that arises from passing
attr(x, "labels")
intois_integerish()
; more on that below,but I'll continue on for the moment.)
The trouble arises if what's passed to
is_integerish()
are values like 9999999999 and 9999999998, in which caseis_integerish()
returnsNA
instead ofTRUE
orFALSE
.is_integerish()
coerces its finite, non-missing input values withas.integer()
, but since (at least in my understanding) R uses 32-bit integers, it can't successfully coerce a value like 9999999999 to integer.So: If a value passed into
is_integerish()
is large enough, at least one element ofbad_labels
(invalidate_dta()
) winds up asNA
.If at least one element of
bad_labels
other than theNA
(s) isTRUE
,if (any(bad_labels))
will still successfully resolve (toTRUE
), at which pointwrite_dta()
will throw an expected error withcli::cli_abort()
. However, if all of the other elements ofbad_labels
other than theNA
(s) areFALSE
,if (any(bad_labels))
can't resolve, the function errors, and no .dta file is written out. E.g., the followingif
statement can't resolve toTRUE
orFALSE
:Created on 2023-11-01 with reprex v2.0.2
This can cause issues in some "real-world" data sets. For example, trying to convert this ICPSR SPSS .por file to .dta throws an error. (Fair warning: An ICPSR account is needed to download that data set.)
Finally, picking up the sidebar from above: As written, is it possible that the
is_integerish(attr(x, 'labels'))
call withinhas_non_integer_labels()
operates on column values and not labels? And if so, is that intended? I may be off-base, but here's a reprex showing what I'm referring to:Created on 2023-11-01 with reprex v2.0.2
It's always possible that I'm misinterpreting something, but it seems to me like
is_integerish()
currently checks the integer status of variable values, not labels (and it errors if the values are sufficiently large). Sinceis_integerish()
is called insidehas_non_integer_labels()
, I was expecting the check to be of the integer status of labels.In summary: I think that
write_dta()
currently winds up checking the integer status of variable values, not labels, and errors if the values are outside the range of a 32-bit integer.One possible fix might entail:
names(attr(x, "labels"))
intois_integerish()
instead of passing inattr(x, "labels")
is_integerish()
before the finalall(x_finite == as.integer(x_finite))
line forx_finite
values outside the range of a 32-bit integer, and returningFALSE
if present (that'd result inhas_non_integer_labels()
then returningTRUE
;bad_labels
would then contain >=1TRUE
, andwrite_dta()
would produce an expected error)If I've identified a real bug that seems worth addressing, I'd be happy to submit a PR with those updates (or alternative tweaks if there's an different, preferable strategy).
haven
has been a lifesaver many times. Thanks!The text was updated successfully, but these errors were encountered: