Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add column specification as in readr #198

Open
dpprdan opened this issue Aug 25, 2016 · 16 comments
Open

Add column specification as in readr #198

dpprdan opened this issue Aug 25, 2016 · 16 comments
Labels
col_types feature a feature request or enhancement

Comments

@dpprdan
Copy link

dpprdan commented Aug 25, 2016

It would be nice to have the same column guessing with the same syntax as in readr.
https://github.com/hadley/readr/releases/tag/v1.0.0

@jennybc jennybc changed the title Feature request: add column guessing as in readr Add column guessing as in readr Jan 7, 2017
@jennybc jennybc added col_types feature a feature request or enhancement labels Jan 7, 2017
@jennybc jennybc changed the title Add column guessing as in readr Add column specification/guessing as in readr Jan 7, 2017
@jennybc
Copy link
Member

jennybc commented Feb 5, 2017

A specific request that appears in #81, at the very least, is to make it easy to set one column type for all columns. Done in readr via, e.g., cols(.default = "c").

Updated: now implemented within the existing readxl style, i.e. if col_types has length one it is recycled for all columns.

@jennybc
Copy link
Member

jennybc commented Feb 17, 2017

It would be nice to specify col type for certain columns and allow others to be guessed.

Update: now implemented within the existing readxl style.

@jennybc
Copy link
Member

jennybc commented Mar 4, 2017

It would be nice to know whether a col type was guessed or specified by user. Conceivably we would warn less (and maybe even coerce more aggressively?) if the type was requested vs. guessed.

jennybc added a commit that referenced this issue Mar 6, 2017
@jennybc jennybc added future and removed feature a feature request or enhancement labels Mar 29, 2017
@jennybc jennybc changed the title Add column specification/guessing as in readr Add column specification as in readr Oct 27, 2017
@jennybc
Copy link
Member

jennybc commented Oct 27, 2017

I've edited the title to reflect my current thinking: it would be good to align with readr re: explicit column specification (including skipping and guessing, which is already supported). But I don't ever see doing readr-style column guessing. readr guesses based on data, which it must. readxl guesses based on Excel types, which -- I argue -- it must.

@nickbond
Copy link

Setting aside the guessing of formats, it would be good to at least align the syntax with readr. At the moment read_excel uses col_types "text" for example, whereas readr functions refer to "character".

@llrs
Copy link

llrs commented May 25, 2019

I asked in #571 to be able to supply a long list of named col_type and use it when the name of the column matches a name on the col_type argument. If I understood correctly readr col_type argument if a column name is not present it doesn't try to guess the type of the column.

library("readr")
mtcars_csv <- system.file("extdata", "mtcars.csv", package="readr")
col_types_list <- cols_only(
    mpg = col_double(),
    cyl = col_integer(),
    disp = col_double()
)
df1 <- read_csv(mtcars_csv, col_types=col_types_list)
ncol(df1) == 3
df2 <- read_csv(mtcars_csv)
ncol(df2) == 11

I think that readxl should try to guess the column types if the type is not provided. Thanks!

@TreyRoady
Copy link

The ability to specify column types by name is absolutely key for large data sets with many columns.

In my case, I'm importing internal enterprise data reports. Read_excel is currently mangling the date-time objects in it, and the ability to go back and tweak just that column would be absolutely invaluable. Otherwise, I'm having to go back and hand-define hundreds of columns just to redefine one.

@jennybc
Copy link
Member

jennybc commented Nov 5, 2019

This is on the roadmap for my next several months. The plan is to do a major upgrade of the col spec here at the same time as adding similar functionality to googlesheets4.

@dpprdan
Copy link
Author

dpprdan commented Nov 5, 2019

@llrs

If I understood correctly readr col_type argument if a column name is not present it doesn't try to guess the type of the column.

This depends on whether you specify the column types list (or col_spec) with cols_only() or cols().

  1. If you specify it with cols_only(), like you did in your example, columns that are not present in your specification are skipped (cols_only() implicitly sets .default = col_skip()).
  2. If you specify the col_spec with cols(), columns that are not present in your specification are guessed (cols() implicitly sets .default = col_guess()).

See also https://readr.tidyverse.org/reference/cols.html

@llrs
Copy link

llrs commented Nov 5, 2019

Thanks for the clarification @dpprdan

@kristjan-kure
Copy link

Any updates?

@jennybc
Copy link
Member

jennybc commented Mar 29, 2022

tidyverse/tidyverse.org#569

What's coming next?
I won't go so far as to promise that 2022 is the year of readxl 😉.
But I can say that top priorities include equipping readxl with better problem reporting and column specification, making its interface feel more similar to that of readr and vroom.

@jxu
Copy link

jxu commented Apr 28, 2023

As of version 1.4.2, the docs say col_types takes only a character vector. Is there a possibility to take a named list, guessing for the unspecified columns?

@diegomsg
Copy link

Until this is not implemented, a workaround I am adopting is using the functions below:

define_readxl_types <- function(col_nms, vec_names, vec_types, default_type = "guess") {
  #get names from earlier readxl without types
  cols <- dplyr::tibble(
    name = col_nms,
    index = seq(1:length(col_nms)),
    type = default_type
  )
  
  #build names and types
  col_def <- dplyr::tibble(
    name = vec_names,
    type = vec_types
  )
  
  #get indexes
  col_def$index <- purrr::map_vec(col_def$name, ~match(.x, cols$name))
  
  #iterate defined types
  for (i in 1:length(col_def)) {
    cols$type[col_def$index[i]] <- col_def$type[i]
  }
  
  #return types vector
  return(cols$type)
}

Probably this is not the best solution, best implementation. It works, though.

Example using readxl example:

> original <- read_excel(readxl_example("deaths.xlsx"), skip = 4)
> str(original)
tibble [14 × 6] (S3: tbl_df/tbl/data.frame)
 $ Name         : chr [1:14] "David Bowie" "Carrie Fisher" "Chuck Berry" "Bill Paxton" ...
 $ Profession   : chr [1:14] "musician" "actor" "musician" "actor" ...
 $ Age          : chr [1:14] "69" "60" "90" "61" ...
 $ Has kids     : chr [1:14] "TRUE" "TRUE" "TRUE" "TRUE" ...
 $ Date of birth: POSIXct[1:14], format: "1947-01-08" "1956-10-21" "1926-10-18" "1955-05-17" ...
 $ Date of death: chr [1:14] "42379" "42731" "42812" "42791" ...

> solution <- read_excel_with_some_types(
+   readxl_example("deaths.xlsx"), skip = 4,
+   vec_names =  c("Age", "Has kids"), vec_types = c("numeric", "logical") )
Warning messages:
1: Expecting numeric in C18 / R18C3: got 'at the' 
2: Expecting logical in D18 / R18C4: got 'bottom,' 
> str(solution)
tibble [14 × 6] (S3: tbl_df/tbl/data.frame)
 $ Name         : chr [1:14] "David Bowie" "Carrie Fisher" "Chuck Berry" "Bill Paxton" ...
 $ Profession   : chr [1:14] "musician" "actor" "musician" "actor" ...
 $ Age          : num [1:14] 69 60 90 61 57 69 82 89 99 53 ...
 $ Has kids     : logi [1:14] TRUE TRUE TRUE TRUE TRUE FALSE ...
 $ Date of birth: POSIXct[1:14], format: "1947-01-08" "1956-10-21" "1926-10-18" "1955-05-17" ...
 $ Date of death: chr [1:14] "42379" "42731" "42812" "42791" ...

Didn't tested it extensively, give it a try.

@jxu
Copy link

jxu commented Aug 11, 2023

@diegomsg if you want the interface to be consistent with readr, the function should take a named list as col guesses

@asadow
Copy link

asadow commented Oct 5, 2023

Echoing @nickbond, I agree on having the syntax match readr. I don't have opinions on which way, but given the same argument name col_types (and Jenny's affiliation), I expected allowable values for col_type to be the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
col_types feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

9 participants