readr no longer reproduces the problems from challenge.csv #1398

ganong123 · 2022-04-18T02:47:38Z

challenge.csv is designed to teach some key challenges of parsing and features of readr. However, the example is broken.

challenge <- read_csv(readr_example("challenge.csv"))
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   x = col_double(),
#>   y = col_logical()
#> )
#> Warning: 1000 parsing failures.
#>  row col           expected     actual                                                           file
#> 1001   y 1/0/T/F/TRUE/FALSE 2015-01-16 '/Users/runner/work/_temp/Library/readr/extdata/challenge.csv'
#> 1002   y 1/0/T/F/TRUE/FALSE 2018-05-18 '/Users/runner/work/_temp/Library/readr/extdata/challenge.csv'
#> 1003   y 1/0/T/F/TRUE/FALSE 2015-09-05 '/Users/runner/work/_temp/Library/readr/extdata/challenge.csv'
#> 1004   y 1/0/T/F/TRUE/FALSE 2012-11-28 '/Users/runner/work/_temp/Library/readr/extdata/challenge.csv'
#> 1005   y 1/0/T/F/TRUE/FALSE 2020-01-13 '/Users/runner/work/_temp/Library/readr/extdata/challenge.csv'
#> .... ... .................. .......... ..............................................................
#> See problems(...) for more details.

source

Here's what I get when I run this code on my computer (this is a screen cap from the vignette, but I have the same issue on my computer)

The text was updated successfully, but these errors were encountered:

sbearrows · 2022-08-30T22:56:39Z

@jennybc It seems like this is due to vroom since I can replicate the parsing error with edition 1

library(readr)
with_edition(
  1,
  read_csv(readr_example("challenge.csv"), show_col_types = FALSE)
)
#> Warning: 1000 parsing failures.
#>  row col           expected     actual                                                                                               file
#> 1001   y 1/0/T/F/TRUE/FALSE 2015-01-16 '/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/readr/extdata/challenge.csv'
#> 1002   y 1/0/T/F/TRUE/FALSE 2018-05-18 '/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/readr/extdata/challenge.csv'
#> 1003   y 1/0/T/F/TRUE/FALSE 2015-09-05 '/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/readr/extdata/challenge.csv'
#> 1004   y 1/0/T/F/TRUE/FALSE 2012-11-28 '/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/readr/extdata/challenge.csv'
#> 1005   y 1/0/T/F/TRUE/FALSE 2020-01-13 '/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/readr/extdata/challenge.csv'
#> .... ... .................. .......... ..................................................................................................
#> See problems(...) for more details.
#> # A tibble: 2,000 × 2
#>        x y    
#>    <dbl> <lgl>
#>  1   404 NA   
#>  2  4172 NA   
#>  3  3004 NA   
#>  4   787 NA   
#>  5    37 NA   
#>  6  2332 NA   
#>  7  2489 NA   
#>  8  1449 NA   
#>  9  3665 NA   
#> 10  3863 NA   
#> # … with 1,990 more rows
#> # ℹ Use `print(n = ...)` to see more rows

^{Created on 2022-08-30 by the reprex package (v2.0.1.9000)}

This is happening in a vignette that we already want/need to update so maybe a minimal solution is best for now. We could specify the column types like we do for problems() tests:

library(readr)
read_csv(
  readr_example("challenge.csv"),
  show_col_types = FALSE,
  col_types = "dl"
)
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> # A tibble: 2,000 × 2
#>        x y    
#>    <dbl> <lgl>
#>  1   404 NA   
#>  2  4172 NA   
#>  3  3004 NA   
#>  4   787 NA   
#>  5    37 NA   
#>  6  2332 NA   
#>  7  2489 NA   
#>  8  1449 NA   
#>  9  3665 NA   
#> 10  3863 NA   
#> # … with 1,990 more rows
#> # ℹ Use `print(n = ...)` to see more rows

^{Created on 2022-08-30 by the reprex package (v2.0.1.9000)}

Otherwise we'd need to modify/replace challenge.csv with something that actually trips vroom. An easy candidate would be something with a varying number of columns per row, because vroom will always warn:

library(readr)
# create a file like this
df <- glue::glue('x,y
                 1,2
                 3,4
                 5,6
                 7
                 8,9
                 10')

tf <- withr::local_tempfile(lines = df)

read_csv(tf, show_col_types = FALSE)
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> # A tibble: 6 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     2
#> 2     3     4
#> 3     5     6
#> 4     7    NA
#> 5     8     9
#> 6    10    NA

^{Created on 2022-08-30 by the reprex package (v2.0.1.9000)}

jennybc · 2022-08-30T23:25:30Z

I think you should update that section of the vignette for the modern readr 2e / vroom era.

Each of these functions firsts calls spec_xxx() (as described above), and then parses the file according to that column specification:

^ this is no longer true, needs rewording

The rectangular parsing functions almost always succeed; they’ll only fail if the format is severely messed up. Instead, readr will generate a data frame of problems. The first few will be printed out, and you can access them all with problems():

You can either force parsing problems to happen with challenge.csv, which at least allows some discussion of problems(). As you say, you can force it by providing (bad) column types. But this is pretty artificial / it's a stop gap.

It would be better (but harder) to think about what sort of realistic problems we want to demonstrate and create a new small dataset that has such a problem. Issues relating to parsing problems in readr and vroom would be a good source of inspiration.

sbearrows · 2022-08-31T17:46:46Z

The most common usage of problems() that I can see are either a user specifies col_types but there is some data later on that is unexpected and doesn't match col_types #1376 or the number of columns per row varies #1328, tidyverse/vroom#439 so I don't think either of the two situations is necessarily artificial/forced

sbearrows added the documentation label Aug 25, 2022

sbearrows self-assigned this Aug 25, 2022

sbearrows linked a pull request Sep 2, 2022 that will close this issue

Rework problems example in "Getting Started" #1431

Open

hadley unassigned sbearrows Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readr no longer reproduces the problems from challenge.csv #1398

readr no longer reproduces the problems from challenge.csv #1398

ganong123 commented Apr 18, 2022

sbearrows commented Aug 30, 2022

jennybc commented Aug 30, 2022 •

edited

Loading

sbearrows commented Aug 31, 2022

readr no longer reproduces the problems from challenge.csv #1398

readr no longer reproduces the problems from challenge.csv #1398

Comments

ganong123 commented Apr 18, 2022

sbearrows commented Aug 30, 2022

jennybc commented Aug 30, 2022 • edited Loading

sbearrows commented Aug 31, 2022

jennybc commented Aug 30, 2022 •

edited

Loading