Add column specification as in readr #198

dpprdan · 2016-08-25T12:31:48Z

It would be nice to have the same column guessing with the same syntax as in readr.
https://github.com/hadley/readr/releases/tag/v1.0.0

jennybc · 2017-02-05T05:48:18Z

A specific request that appears in #81, at the very least, is to make it easy to set one column type for all columns. Done in readr via, e.g., cols(.default = "c").

Updated: now implemented within the existing readxl style, i.e. if col_types has length one it is recycled for all columns.

jennybc · 2017-02-17T14:48:39Z

It would be nice to specify col type for certain columns and allow others to be guessed.

Update: now implemented within the existing readxl style.

jennybc · 2017-03-04T17:34:00Z

It would be nice to know whether a col type was guessed or specified by user. Conceivably we would warn less (and maybe even coerce more aggressively?) if the type was requested vs. guessed.

Relates to a "nice to have" mentioned in #198

jennybc · 2017-10-27T20:09:32Z

I've edited the title to reflect my current thinking: it would be good to align with readr re: explicit column specification (including skipping and guessing, which is already supported). But I don't ever see doing readr-style column guessing. readr guesses based on data, which it must. readxl guesses based on Excel types, which -- I argue -- it must.

nickbond · 2018-09-10T01:27:02Z

Setting aside the guessing of formats, it would be good to at least align the syntax with readr. At the moment read_excel uses col_types "text" for example, whereas readr functions refer to "character".

llrs · 2019-05-25T07:53:02Z

I asked in #571 to be able to supply a long list of named col_type and use it when the name of the column matches a name on the col_type argument. If I understood correctly readr col_type argument if a column name is not present it doesn't try to guess the type of the column.

library("readr")
mtcars_csv <- system.file("extdata", "mtcars.csv", package="readr")
col_types_list <- cols_only(
    mpg = col_double(),
    cyl = col_integer(),
    disp = col_double()
)
df1 <- read_csv(mtcars_csv, col_types=col_types_list)
ncol(df1) == 3
df2 <- read_csv(mtcars_csv)
ncol(df2) == 11

I think that readxl should try to guess the column types if the type is not provided. Thanks!

TreyRoady · 2019-11-04T23:03:34Z

The ability to specify column types by name is absolutely key for large data sets with many columns.

In my case, I'm importing internal enterprise data reports. Read_excel is currently mangling the date-time objects in it, and the ability to go back and tweak just that column would be absolutely invaluable. Otherwise, I'm having to go back and hand-define hundreds of columns just to redefine one.

jennybc · 2019-11-05T01:10:06Z

This is on the roadmap for my next several months. The plan is to do a major upgrade of the col spec here at the same time as adding similar functionality to googlesheets4.

dpprdan · 2019-11-05T10:29:54Z

@llrs

If I understood correctly readr col_type argument if a column name is not present it doesn't try to guess the type of the column.

This depends on whether you specify the column types list (or col_spec) with cols_only() or cols().

If you specify it with cols_only(), like you did in your example, columns that are not present in your specification are skipped (cols_only() implicitly sets .default = col_skip()).
If you specify the col_spec with cols(), columns that are not present in your specification are guessed (cols() implicitly sets .default = col_guess()).

See also https://readr.tidyverse.org/reference/cols.html

llrs · 2019-11-05T10:33:40Z

Thanks for the clarification @dpprdan

kristjan-kure · 2022-03-29T20:22:33Z

Any updates?

jennybc · 2022-03-29T20:37:51Z

tidyverse/tidyverse.org#569

What's coming next?
I won't go so far as to promise that 2022 is the year of readxl 😉.
But I can say that top priorities include equipping readxl with better problem reporting and column specification, making its interface feel more similar to that of readr and vroom.

jxu · 2023-04-28T02:04:05Z

As of version 1.4.2, the docs say col_types takes only a character vector. Is there a possibility to take a named list, guessing for the unspecified columns?

diegomsg · 2023-08-11T14:07:00Z

Until this is not implemented, a workaround I am adopting is using the functions below:

define_readxl_types <- function(col_nms, vec_names, vec_types, default_type = "guess") {
  #get names from earlier readxl without types
  cols <- dplyr::tibble(
    name = col_nms,
    index = seq(1:length(col_nms)),
    type = default_type
  )
  
  #build names and types
  col_def <- dplyr::tibble(
    name = vec_names,
    type = vec_types
  )
  
  #get indexes
  col_def$index <- purrr::map_vec(col_def$name, ~match(.x, cols$name))
  
  #iterate defined types
  for (i in 1:length(col_def)) {
    cols$type[col_def$index[i]] <- col_def$type[i]
  }
  
  #return types vector
  return(cols$type)
}

Probably this is not the best solution, best implementation. It works, though.

Example using readxl example:

> original <- read_excel(readxl_example("deaths.xlsx"), skip = 4)
> str(original)
tibble [14 × 6] (S3: tbl_df/tbl/data.frame)
 $ Name         : chr [1:14] "David Bowie" "Carrie Fisher" "Chuck Berry" "Bill Paxton" ...
 $ Profession   : chr [1:14] "musician" "actor" "musician" "actor" ...
 $ Age          : chr [1:14] "69" "60" "90" "61" ...
 $ Has kids     : chr [1:14] "TRUE" "TRUE" "TRUE" "TRUE" ...
 $ Date of birth: POSIXct[1:14], format: "1947-01-08" "1956-10-21" "1926-10-18" "1955-05-17" ...
 $ Date of death: chr [1:14] "42379" "42731" "42812" "42791" ...

> solution <- read_excel_with_some_types(
+   readxl_example("deaths.xlsx"), skip = 4,
+   vec_names =  c("Age", "Has kids"), vec_types = c("numeric", "logical") )
Warning messages:
1: Expecting numeric in C18 / R18C3: got 'at the' 
2: Expecting logical in D18 / R18C4: got 'bottom,' 
> str(solution)
tibble [14 × 6] (S3: tbl_df/tbl/data.frame)
 $ Name         : chr [1:14] "David Bowie" "Carrie Fisher" "Chuck Berry" "Bill Paxton" ...
 $ Profession   : chr [1:14] "musician" "actor" "musician" "actor" ...
 $ Age          : num [1:14] 69 60 90 61 57 69 82 89 99 53 ...
 $ Has kids     : logi [1:14] TRUE TRUE TRUE TRUE TRUE FALSE ...
 $ Date of birth: POSIXct[1:14], format: "1947-01-08" "1956-10-21" "1926-10-18" "1955-05-17" ...
 $ Date of death: chr [1:14] "42379" "42731" "42812" "42791" ...

Didn't tested it extensively, give it a try.

jxu · 2023-08-11T21:57:08Z

@diegomsg if you want the interface to be consistent with readr, the function should take a named list as col guesses

asadow · 2023-10-05T18:05:09Z

Echoing @nickbond, I agree on having the syntax match readr. I don't have opinions on which way, but given the same argument name col_types (and Jenny's affiliation), I expected allowable values for col_type to be the same.

jennybc changed the title ~~Feature request: add column guessing as in readr~~ Add column guessing as in readr Jan 7, 2017

jennybc added col_types feature a feature request or enhancement labels Jan 7, 2017

jennybc changed the title ~~Add column guessing as in readr~~ Add column specification/guessing as in readr Jan 7, 2017

jennybc mentioned this issue Feb 5, 2017

Empty columns: to drop or not to drop? #157

Closed

This was referenced Feb 7, 2017

Feature Request: Read in all columns as the same type #249

Closed

Refactor xlsx col type handling #261

Merged

This was referenced Mar 5, 2017

Add logical cell & col types; refactor cell typing and coercion #277

Merged

Selective col type guessing; add CELL_UNKNOWN, COL_UNKNOWN #286

Merged

jennybc added a commit that referenced this issue Mar 6, 2017

Selective col type guessing; add CELL_UNKNOWN, COL_UNKNOWN (#286)

a7aaa2c

Relates to a "nice to have" mentioned in #198

jennybc added future and removed feature a feature request or enhancement labels Mar 29, 2017

jennybc mentioned this issue Apr 18, 2017

need timezone specification on import or read timestamps as string not number #347

Closed

jennybc mentioned this issue May 15, 2017

predict coltypes before reading. #364

Closed

jennybc mentioned this issue Jul 31, 2017

Support reading from more general inputs #278

Open

jennybc changed the title ~~Add column specification/guessing as in readr~~ Add column specification as in readr Oct 27, 2017

jennybc mentioned this issue Dec 22, 2017

feature request: use column names in col_types #415

Closed

batpigandme mentioned this issue Jul 5, 2018

col_types = cols() argument works in read_csv but not read_excel #490

Closed

jennybc mentioned this issue Nov 1, 2018

Compact string representation for col_types (feature request) #515

Closed

jennybc added feature a feature request or enhancement and removed future labels Dec 11, 2018

jennybc mentioned this issue Dec 14, 2018

Allow user to specify col type is just date or time (vs full datetime) #504

Open

jennybc mentioned this issue May 25, 2019

Use a named col_type #571

Closed

ldecicco-USGS mentioned this issue Nov 15, 2019

Force primary key columns to have same type DOI-USGS/toxEval#348

Closed

jennybc mentioned this issue May 21, 2020

Add cols() specification tidyverse/googlesheets4#156

Closed

jennybc mentioned this issue Aug 16, 2020

col_types = cols() argument works in read_csv but not read_excel #627

Closed

jennybc mentioned this issue Aug 17, 2022

show_col_types argument #698

Closed

diegomsg mentioned this issue Aug 11, 2023

Force col_types for vector of known columns, default guess for other columns #735

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add column specification as in readr #198

Add column specification as in readr #198

dpprdan commented Aug 25, 2016

jennybc commented Feb 5, 2017 •

edited

Loading

jennybc commented Feb 17, 2017 •

edited

Loading

jennybc commented Mar 4, 2017

jennybc commented Oct 27, 2017

nickbond commented Sep 10, 2018

llrs commented May 25, 2019

TreyRoady commented Nov 4, 2019

jennybc commented Nov 5, 2019

dpprdan commented Nov 5, 2019

llrs commented Nov 5, 2019

kristjan-kure commented Mar 29, 2022

jennybc commented Mar 29, 2022

jxu commented Apr 28, 2023

diegomsg commented Aug 11, 2023

jxu commented Aug 11, 2023

asadow commented Oct 5, 2023

Add column specification as in readr #198

Add column specification as in readr #198

Comments

dpprdan commented Aug 25, 2016

jennybc commented Feb 5, 2017 • edited Loading

jennybc commented Feb 17, 2017 • edited Loading

jennybc commented Mar 4, 2017

jennybc commented Oct 27, 2017

nickbond commented Sep 10, 2018

llrs commented May 25, 2019

TreyRoady commented Nov 4, 2019

jennybc commented Nov 5, 2019

dpprdan commented Nov 5, 2019

llrs commented Nov 5, 2019

kristjan-kure commented Mar 29, 2022

jennybc commented Mar 29, 2022

jxu commented Apr 28, 2023

diegomsg commented Aug 11, 2023

jxu commented Aug 11, 2023

asadow commented Oct 5, 2023

jennybc commented Feb 5, 2017 •

edited

Loading

jennybc commented Feb 17, 2017 •

edited

Loading