-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty columns: to drop or not to drop? #157
Comments
I now see that dropping empty rows and columns is actually mentioned as a feature in the readxl README. But I’d consider it a _mis_feature. Having columns disappear or appear depending on their content is almost a recipe for disaster, IMO. (It certainly caused some hard-to-debug problems for me, resulting in this bug report and bug report #156.) Also, the ‘feature’ of automatically dropping empty columns (arguably) doesn’t actually work, as shown by the following code:
Columns D and G are empty but not dropped. |
This seems mutually exclusive with #156 ;) The problem is that in excel there's no way to tell the difference between an empty cell and a cell that does not exist, but that difference exists in the on-disk serialisation of the workbook. I think the intent is for readxl to treat them both the same way, as excel does, but it looks like we're inconsistent wrt cols and rows. |
Hmm. I’m not sure how this is mutually exclusive with #156. The preferred solution for both bugs would be that both empty rows and columns should be included in the resulting data frame (perhaps with an option to drop them, which I agree can be useful sometimes, especially for messy Excel data). BTW, I didn’t know how the data was being stored in the Excel file. I have now taken a look. It’s not pretty. :/ |
I came here because I was trying to read a folder of excel files produced in a poorly formatted way. The given column names were paragraphs, so I did the skip=1, produced my own names and types -- and got the same problem with dropped columns, specifically that the number of names and columns were not equal. Also, got a number of weird date formatting warnings (expecting date but got '12/28/1986'), but that's another topic. |
Adding my woes here. I have a file where the row names in the Excel file are not analysis-friendly -- spaces, punctuation, overly-long names, etc. So I tried to read it by using Unfortunately a lot of the columns are empty besides the header, which caused the hard-to-detect error mentioned by huftis. Seems inconsistent to get a An easy fix might be to ignore For now I'm setting |
Hi! This is also a major problem for me. I'm importing a bunch of data, that are filled in template xlsx files that has empty columns. However, sometimes users can have entered something in a cell of some empty column, and then all columns is shifted, and there is no convenient and safe way to see how that has been done. What would be most convenient is to be able to ask for the column names to be set according to the excel standard (i.e. "A", "B", ...). One can then write expressions such as An open question: What is the purpose of naming the columns X1,...,XN when empty columns are not counted? In read.xlsx of the xlsx package, empty columns are counted and the colnames can for example be X2,X3,X4 if the first column is missing (and thus skipped). In read_excel however, they are always named X1,..., even if some columns are skipped. To me, these names seem rather redundant since one might as well access a column with 1 as with "X1", and in the current setting they are always 1-to-1!? Suggestion: Create a from_excel variable with a special class (with the same name) or other attribute, and then allow that to be sent to read_excel as in How difficult would it be to detect what actual column each read column corresponds to? Since MS Excel and read.xlsx does it, it must be feasible somehow!? |
The row skipping behaviour of xlsx is about to get straightened out (#240), so I'm editing the title of this issue to better reflect the important points of discussion. Which is really about column retention and naming. |
I'm beginning to think that readxl should NOT drop empty columns automatically. readr does not do so either: readr::read_csv(",x,,y,\n,1,,2,\n,,,,\n")
#> Warning: Missing column names filled in: 'X1' [1], 'X3' [3], 'X5' [5]
#> # A tibble: 2 × 5
#> X1 x X3 y X5
#> <chr> <int> <chr> <int> <chr>
#> 1 <NA> 1 <NA> 2 <NA>
#> 2 <NA> NA <NA> NA <NA> I am talking specifically about leading or embedded empty columns, as opposed to trailing. Trailing empty, unnamed columns are impossible to get because they are basically undefined. I think that's what @hadley means by
readr even retains trailing empty rows (see above), which, for the same reason as for columns, isn't really possible in the Excel domain. If someone does NOT want an embedded empty column, it is already possible to skip it by setting its type to path <- system.file("extdata/datasets.xlsx", package = "readxl")
readxl::read_excel(path, col_types = c("blank", "blank", "numeric", "text", 'text'))
#> # A tibble: 150 × 3
#> Petal.Length Petal.Width Species
#> <dbl> <chr> <chr>
#> 1 1.4 0.2 setosa
#> 2 1.4 0.2 setosa
#> 3 1.3 0.2 setosa
#> 4 1.5 0.2 setosa
#> 5 1.4 0.2 setosa
#> 6 1.7 0.4 setosa
#> 7 1.4 0.3 setosa
#> 8 1.5 0.2 setosa
#> 9 1.4 0.2 setosa
#> 10 1.5 0.1 setosa
#> # ... with 140 more rows @hadley Do you agree that readxl should not drop an embedded empty column, even if it is unnamed? If people want to read from a specific cell rectangle, even if there's no data there in the current sheet, that is a matter for #8. |
Agreed. I think the current behaviour is mostly due to my naïveté about how excel records empty cells. |
I think I have run into something related to this thread. Sometimes header information falls into 2 or more rows in an Excel file, akin to this helpful new vignette. In wrangling some data with a similar format, I think I have found a situation where Using clippy.xlsx from library(readxl) #using dev vesion from GitHub for clippy.xlsx with Sheet #2
library(magrittr)
clippy_shipped_colnames1 <- readxl_example("clippy.xlsx") %>%
read_excel(sheet = 2, col_names = FALSE, n_max = 1)
#> New names:
#> -> ..1
#> -> ..2
#> -> ..3
#> -> ..4
clippy_shipped_colnames2 <- readxl_example("clippy.xlsx") %>%
read_excel(sheet = 2, col_names = FALSE, n_max = 1, skip = 1)
#> New names:
#> -> ..1
#> -> ..2
#> -> ..3
#> -> ..4 Created on 2018-07-19 by the reprex If you have a file with some leading NAs in one of the target rows, they are dropped. Using this edited clippy file as an example: clippy_foo.xlsx
This might be intentional and there are other ways to pull the row w/o dropping the |
I note that you're using pretty different arguments than the vignette, i.e. But yes, by default, readxl "shrink-wraps" data. So leading or trailing blank cells could cause the initial col-name read to shrink to the contiguous non-empty cells and produce too few column names. If that is an issue, you'll need to provide more external info about your desired geometry. This vignette explains the various ways to express this: |
No worries! It is hard to know what to do re: blank cells but the current "shrink wrap by default" approach was chosen with much thought and has held up pretty well since it went info full force >1 year ago. But yeah it means you will sometimes have to take the |
When importing Excel files, empty columns seem to be deleted if one uses a non-zero
skip
argument inread_excel()
. Here’s an example file:http://huftis.org/nedlasting/excel/row-col-test.xlsx
The file contains two similar worksheets/tabs. Both have 8 columns (named A–H) and 8 rows. The diagonal is filled with the numbers 1 to 8 for the first worksheet. The second worksheet is similar, but is missing the number 4 and 7, resulting in column 4 and and 7 being empty.
Reading the first worksheet works fine:
But reading the second worksheet results in a data frame where the empty columns are missing:
Note that the (automatically generated) column names indicate that there once was an X3 and X6 column.
If one uses
skip=0
, empty columns are not dropped:(Here the values have been converted to character values, since the first row contains non-numeric data.)
The text was updated successfully, but these errors were encountered: