-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
readxl skip and column designation not working if there are blanks #101
Comments
I have experienced a similar glitch. |
Concrete examples of the above would be helpful to work against. How to provide a readxl reprexWe're in a much better position to address your issue if you can provide a reprex (reproducible example). Provide as much of this as you can:
How to provide the xls/xlsx file? In order of preference:
|
#240) * Load cells at xlsx worksheet ingest; handle skipping and/or blank rows; fixes #224 * Actually these *should* be the same * Make skipping tests more challenging (blank row btwn col names and data, plus another embedded blank row) * Simplify return of 0x0 tibble for completely empty worksheet * Mark cells to start reading from at the time of worksheet construction * Remove vestigial, internal cell printing function * Be quiet about empty worksheet * More tests of nothingness * Inform worksheet about its own name * Tighten up the NEWs bullet for this PR * Make some accessor member functions const * Use skip() to issue note-to-future-self Expedient place to park a summary of what this entire PR does. * Improved parsing of sheet geometry for xlsx. (#240, @jennybc). - Better handling of leading and embedded blank rows and explicit row skipping. (#224, #194, #178, #156, #101) - Worksheets that are completely empty or that contain only column names no longer error, but return a tibble with zero rows. (#222, #144, #65) - Location is inferred for cells that do not declare their location (e.g. xlsx written by JMP). (#163, #102)
I have an MS excel sheet containing around 16,000 rows and 16 columns. There is some un-needed text above the column headers, so I need to skip the first three rows when reading the data in (4th row contains column headers, 5th row is the start of the data).
However I noticed that using skip=3 resulted in the column headers not being read in and everything started on the 5th row.
Also, the 2nd and 3rd column of my data contain text (names) but in one file the first 1500 cells in these columns are blank because the names are missing. These columns are misidentified as numeric unless I specify col_types. The 4th and subsequent columns contain data for all rows. I tried adding an initial column with the sequence number but the result was the same.
I think both behaviors are due to the missing data; not sure if it would be possible to correct this by going to the first cell that has data in it for each column and then determining the column type from there and below (I guess if the data is mixed numeric and text, the column type would default to text and in order to do this multiple cells would have to be evaluated?)
If the automated column type assignation would work even when there is some missing data, this would be helpful as although not too much of a problem to manually specify the types for 16 columns, I often have many more than this.
Would be great if there was a fix for this - otherwise the package works great and I found it much faster than the java based ones.
The text was updated successfully, but these errors were encountered: