-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor xlsx col type handling #261
Conversation
Even though I'm not sure what it's for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall strategy looks good.
I'll review again once you've renamed a few variables along my suggestions — I think that will make it easier for me to understand the code flow.
I'd also seriously consider computing and caching the cell type when it's constructed.
src/CellType.h
Outdated
enum CellType { | ||
CELL_BLANK, | ||
CELL_DATE, | ||
CELL_NUMERIC, | ||
CELL_SKIP, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's coincidental that the names were previously in alphabetical order. I think CELL_SKIP
should go next to CELL_BLANK
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just got an education on the importance of the order here. Short version: CELL_BLANK
must be first (= default type), but CELL_SKIP
can come next.
src/CellType.h
Outdated
} else if (type == "text") { | ||
types.push_back(CELL_TEXT); | ||
} else { | ||
Rcpp::warning("Unknown type '%s' at position %i. Using text instead.", | ||
Rcpp::warning("Unknown type '%s' at position %i. Using 'text' instead.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to just stop here to clarify that error handling happens on the R side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
case CELL_NUMERIC: | ||
switch(type) { | ||
case CELL_BLANK: | ||
REAL(col)[i] = NA_REAL; | ||
break; | ||
case CELL_NUMERIC: | ||
case CELL_SKIP: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this go with CELL_BLANK
? (Not that it really matters)
src/XlsxWorkSheet.cpp
Outdated
std::vector<CellType> colTypes; | ||
switch(TYPEOF(col_types)) { | ||
case NILSXP: | ||
colTypes = ws.colTypes(na, guess_max, sheetHasColumnNames); | ||
break; | ||
case STRSXP: | ||
colTypes = cellTypes(as<CharacterVector>(col_types)); | ||
if ((int) colTypes.size() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be slightly cleaner if you either use std::fill()
or Rcpp::rep()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used std::fill()
. Nice to know about that!
src/XlsxWorkSheet.cpp
Outdated
} | ||
|
||
// Rationalize column names w.r.t. types ----------------------------- | ||
size_t ncol_names = colNames.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe pull this out into a separate helper function? Then you'll be able to re-use for xls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New helper function = reconcileNames()
.
src/XlsxWorkSheet.h
Outdated
} | ||
XlsxCell xcell = *it; | ||
if (xcell.col() < ncol_) { | ||
CellType type = xcell.type(na, wb_.stringTable(), wb_.dateStyles()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it might be worthwhile to compute and cache the type in the cell constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reason to NOT do this: when cells are constructed in loadCells()
, we don't know about skipped columns yet. We do the bare minimum: register the cell's existence and its coordinates.
If user supplies col_types
and skips columns, we will never read type (or content) for the associated cells (and I've said as much in the docs).
If user does not supply col_types
, we learn them. But that is also limited to guess_max
rows.
If we cache cell type in the cell constructor, we might do so for a lot of cells that will never need it.
Does this change your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's a good point and it does change my opinion
src/XlsxWorkSheet.h
Outdated
if (type > types[xcell.col()]) { | ||
types[xcell.col()] = type; | ||
} | ||
XlsxCell xcell = *it; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will copy the xcell - I think it would be better to rename it
to xcell
and then use (e.g.) xcell->type()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, here and other places.
src/XlsxWorkSheet.h
Outdated
int base = firstRow_->row() + has_col_names; | ||
// we have consulted i rows re: determining col types | ||
int i; | ||
// account for any empty rows between column headers and data start | ||
i = it->row() - base; | ||
// m is the max row number seen so far | ||
int m = it->row(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling these similar things i
and m
feels a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified. m
is gone.
tests/testthat/test-missing-values.R
Outdated
## in a trailing empty column WHICH SHOULD BE DROPPED | ||
## in some trailing rows WHICH SHOULD BE DROPPED | ||
out <- read_excel(test_sheet("style-only-cells.xlsx")) | ||
df <- tibble::tibble(var1 = c("val1,1", "val2,1", "val3,1"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indenting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Wrt 4., I think it's ok, but we should explore whether we can do the same thing in readr. I have a vague memory that it's more complicated because you can associate names with types there. |
This is ready for re-review. |
// Needs to compare to actual cell type to give warnings | ||
switch(types[xcell.col()]) { | ||
case CELL_BLANK: | ||
CellType type = xcell->type(na, wb_.stringTable(), wb_.dateStyles()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a pity this diff looks so big, because it's really not. In this section, it's just the addition(s) of CELL_SKIP
. Which never comes up, but it seems I have to include all enum
values in all the switch()
es.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing major, just a few minor tweaks
src/XlsxWorkSheet.cpp
Outdated
for (size_t i = 0; i < colTypes.size(); i++) { | ||
if (colTypes[i] == CELL_BLANK) { | ||
colTypes[i] = CELL_NUMERIC; | ||
} | ||
if (colTypes[i] != CELL_SKIP) { | ||
ncol_noskip++; | ||
} | ||
} | ||
|
||
// Rationalize column names w.r.t. types ----------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can delete this comment now (as the function name should be sufficient)
src/XlsxWorkSheet.h
Outdated
XlsxCell xcell = *it; | ||
if (types[xcell.col()] == CELL_SKIP || xcell.col() >= ncol_) { | ||
it++; | ||
if (types[xcell->col()] == CELL_SKIP || xcell->col() >= ncol_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would all be a little more compact with int j = xcell->col()
src/XlsxWorkSheet.h
Outdated
// row to write into | ||
int row = xcell.row() - base; | ||
int row = xcell->row() - base; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use i
here?
tests/testthat/test-missing-values.R
Outdated
var2 = NA_real_, | ||
var3 = c("aa", "bb", "cc"), | ||
X__1 = NA_real_, | ||
var5 = c(1, 2, 3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing parens should be on its own line
If user wants to skip a column, now we say
"skip"
instead of"blank"
. Created a newCellType
,CELL_SKIP
for this. ExistingCellType
,CELL_BLANK
now really means what it says.We're using
CellType
for 2 different concepts that almost coincide but not quite: cell type and column type. I thought about separating them but it didn't seem worth it. A very recent PR (Addcol_type = "list"
option #256) from a user proposes to implement a"list"
type, where each cell can have the type declared in the Excel. And he, possibly out of necessity, had to separate the cell type from column type. Something to think about.Leading and embedded empty columns are never dropped.
If
col_types
has length one, recycle it to have length = number of columns.More effort to make sense of user-specified
col_names
, if there are also user-specifiedcol_types
. Specifically, if column(s) will be skipped, accept full-lengthcol_names
orcol_names
post-skip. I suspect you might not approve of this?AWKWARD: a few things had to be fixed or improved. But it doesn't seem to make sense to build up a flexible col spec system in readxl, since it should just work like readr (#198). But that would require pulling lots of logic out of readr into a col spec package. Which isn't imminent. 🤔