-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add column specification as in readr #198
Comments
A specific request that appears in #81, at the very least, is to make it easy to set one column type for all columns. Done in readr via, e.g., Updated: now implemented within the existing readxl style, i.e. if |
It would be nice to specify col type for certain columns and allow others to be guessed. Update: now implemented within the existing readxl style. |
It would be nice to know whether a col type was guessed or specified by user. Conceivably we would warn less (and maybe even coerce more aggressively?) if the type was requested vs. guessed. |
Relates to a "nice to have" mentioned in #198
I've edited the title to reflect my current thinking: it would be good to align with readr re: explicit column specification (including skipping and guessing, which is already supported). But I don't ever see doing readr-style column guessing. readr guesses based on data, which it must. readxl guesses based on Excel types, which -- I argue -- it must. |
Setting aside the guessing of formats, it would be good to at least align the syntax with readr. At the moment read_excel uses col_types "text" for example, whereas readr functions refer to "character". |
I asked in #571 to be able to supply a long list of named col_type and use it when the name of the column matches a name on the col_type argument. If I understood correctly readr col_type argument if a column name is not present it doesn't try to guess the type of the column. library("readr")
mtcars_csv <- system.file("extdata", "mtcars.csv", package="readr")
col_types_list <- cols_only(
mpg = col_double(),
cyl = col_integer(),
disp = col_double()
)
df1 <- read_csv(mtcars_csv, col_types=col_types_list)
ncol(df1) == 3
df2 <- read_csv(mtcars_csv)
ncol(df2) == 11 I think that readxl should try to guess the column types if the type is not provided. Thanks! |
The ability to specify column types by name is absolutely key for large data sets with many columns. In my case, I'm importing internal enterprise data reports. Read_excel is currently mangling the date-time objects in it, and the ability to go back and tweak just that column would be absolutely invaluable. Otherwise, I'm having to go back and hand-define hundreds of columns just to redefine one. |
This is on the roadmap for my next several months. The plan is to do a major upgrade of the col spec here at the same time as adding similar functionality to googlesheets4. |
This depends on whether you specify the column types list (or
|
Thanks for the clarification @dpprdan |
Any updates? |
As of version 1.4.2, the docs say |
Until this is not implemented, a workaround I am adopting is using the functions below: define_readxl_types <- function(col_nms, vec_names, vec_types, default_type = "guess") {
#get names from earlier readxl without types
cols <- dplyr::tibble(
name = col_nms,
index = seq(1:length(col_nms)),
type = default_type
)
#build names and types
col_def <- dplyr::tibble(
name = vec_names,
type = vec_types
)
#get indexes
col_def$index <- purrr::map_vec(col_def$name, ~match(.x, cols$name))
#iterate defined types
for (i in 1:length(col_def)) {
cols$type[col_def$index[i]] <- col_def$type[i]
}
#return types vector
return(cols$type)
} Probably this is not the best solution, best implementation. It works, though. Example using > original <- read_excel(readxl_example("deaths.xlsx"), skip = 4)
> str(original)
tibble [14 × 6] (S3: tbl_df/tbl/data.frame)
$ Name : chr [1:14] "David Bowie" "Carrie Fisher" "Chuck Berry" "Bill Paxton" ...
$ Profession : chr [1:14] "musician" "actor" "musician" "actor" ...
$ Age : chr [1:14] "69" "60" "90" "61" ...
$ Has kids : chr [1:14] "TRUE" "TRUE" "TRUE" "TRUE" ...
$ Date of birth: POSIXct[1:14], format: "1947-01-08" "1956-10-21" "1926-10-18" "1955-05-17" ...
$ Date of death: chr [1:14] "42379" "42731" "42812" "42791" ...
> solution <- read_excel_with_some_types(
+ readxl_example("deaths.xlsx"), skip = 4,
+ vec_names = c("Age", "Has kids"), vec_types = c("numeric", "logical") )
Warning messages:
1: Expecting numeric in C18 / R18C3: got 'at the'
2: Expecting logical in D18 / R18C4: got 'bottom,'
> str(solution)
tibble [14 × 6] (S3: tbl_df/tbl/data.frame)
$ Name : chr [1:14] "David Bowie" "Carrie Fisher" "Chuck Berry" "Bill Paxton" ...
$ Profession : chr [1:14] "musician" "actor" "musician" "actor" ...
$ Age : num [1:14] 69 60 90 61 57 69 82 89 99 53 ...
$ Has kids : logi [1:14] TRUE TRUE TRUE TRUE TRUE FALSE ...
$ Date of birth: POSIXct[1:14], format: "1947-01-08" "1956-10-21" "1926-10-18" "1955-05-17" ...
$ Date of death: chr [1:14] "42379" "42731" "42812" "42791" ... Didn't tested it extensively, give it a try. |
@diegomsg if you want the interface to be consistent with readr, the function should take a named list as col guesses |
Echoing @nickbond, I agree on having the syntax match readr. I don't have opinions on which way, but given the same argument name |
It would be nice to have the same column guessing with the same syntax as in readr.
https://github.com/hadley/readr/releases/tag/v1.0.0
The text was updated successfully, but these errors were encountered: