Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor xlsx col type handling #261

Merged
merged 33 commits into from
Feb 13, 2017
Merged

Refactor xlsx col type handling #261

merged 33 commits into from
Feb 13, 2017

Conversation

jennybc
Copy link
Member

@jennybc jennybc commented Feb 12, 2017

  1. If user wants to skip a column, now we say "skip" instead of "blank". Created a new CellType, CELL_SKIP for this. Existing CellType, CELL_BLANK now really means what it says.

    We're using CellType for 2 different concepts that almost coincide but not quite: cell type and column type. I thought about separating them but it didn't seem worth it. A very recent PR (Add col_type = "list" option #256) from a user proposes to implement a "list" type, where each cell can have the type declared in the Excel. And he, possibly out of necessity, had to separate the cell type from column type. Something to think about.

  2. Leading and embedded empty columns are never dropped.

  3. If col_types has length one, recycle it to have length = number of columns.

  4. More effort to make sense of user-specified col_names, if there are also user-specified col_types. Specifically, if column(s) will be skipped, accept full-length col_names or col_names post-skip. I suspect you might not approve of this?

AWKWARD: a few things had to be fixed or improved. But it doesn't seem to make sense to build up a flexible col spec system in readxl, since it should just work like readr (#198). But that would require pulling lots of logic out of readr into a col spec package. Which isn't imminent. 🤔

@jennybc jennybc requested a review from hadley February 12, 2017 06:07
@jennybc jennybc mentioned this pull request Feb 12, 2017
Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall strategy looks good.

I'll review again once you've renamed a few variables along my suggestions — I think that will make it easier for me to understand the code flow.

I'd also seriously consider computing and caching the cell type when it's constructed.

src/CellType.h Outdated
enum CellType {
CELL_BLANK,
CELL_DATE,
CELL_NUMERIC,
CELL_SKIP,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's coincidental that the names were previously in alphabetical order. I think CELL_SKIP should go next to CELL_BLANK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just got an education on the importance of the order here. Short version: CELL_BLANK must be first (= default type), but CELL_SKIP can come next.

src/CellType.h Outdated
} else if (type == "text") {
types.push_back(CELL_TEXT);
} else {
Rcpp::warning("Unknown type '%s' at position %i. Using text instead.",
Rcpp::warning("Unknown type '%s' at position %i. Using 'text' instead.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to just stop here to clarify that error handling happens on the R side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

case CELL_NUMERIC:
switch(type) {
case CELL_BLANK:
REAL(col)[i] = NA_REAL;
break;
case CELL_NUMERIC:
case CELL_SKIP:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this go with CELL_BLANK? (Not that it really matters)

std::vector<CellType> colTypes;
switch(TYPEOF(col_types)) {
case NILSXP:
colTypes = ws.colTypes(na, guess_max, sheetHasColumnNames);
break;
case STRSXP:
colTypes = cellTypes(as<CharacterVector>(col_types));
if ((int) colTypes.size() == 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be slightly cleaner if you either use std::fill() or Rcpp::rep()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used std::fill(). Nice to know about that!

}

// Rationalize column names w.r.t. types -----------------------------
size_t ncol_names = colNames.size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe pull this out into a separate helper function? Then you'll be able to re-use for xls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New helper function = reconcileNames().

}
XlsxCell xcell = *it;
if (xcell.col() < ncol_) {
CellType type = xcell.type(na, wb_.stringTable(), wb_.dateStyles());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it might be worthwhile to compute and cache the type in the cell constructor?

Copy link
Member Author

@jennybc jennybc Feb 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason to NOT do this: when cells are constructed in loadCells(), we don't know about skipped columns yet. We do the bare minimum: register the cell's existence and its coordinates.

If user supplies col_types and skips columns, we will never read type (or content) for the associated cells (and I've said as much in the docs).

If user does not supply col_types, we learn them. But that is also limited to guess_max rows.

If we cache cell type in the cell constructor, we might do so for a lot of cells that will never need it.

Does this change your opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a good point and it does change my opinion

if (type > types[xcell.col()]) {
types[xcell.col()] = type;
}
XlsxCell xcell = *it;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will copy the xcell - I think it would be better to rename it to xcell and then use (e.g.) xcell->type()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, here and other places.

int base = firstRow_->row() + has_col_names;
// we have consulted i rows re: determining col types
int i;
// account for any empty rows between column headers and data start
i = it->row() - base;
// m is the max row number seen so far
int m = it->row();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling these similar things i and m feels a bit confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified. m is gone.

## in a trailing empty column WHICH SHOULD BE DROPPED
## in some trailing rows WHICH SHOULD BE DROPPED
out <- read_excel(test_sheet("style-only-cells.xlsx"))
df <- tibble::tibble(var1 = c("val1,1", "val2,1", "val3,1"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indenting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@hadley
Copy link
Member

hadley commented Feb 12, 2017

Wrt 4., I think it's ok, but we should explore whether we can do the same thing in readr. I have a vague memory that it's more complicated because you can associate names with types there.

@jennybc
Copy link
Member Author

jennybc commented Feb 13, 2017

This is ready for re-review.

// Needs to compare to actual cell type to give warnings
switch(types[xcell.col()]) {
case CELL_BLANK:
CellType type = xcell->type(na, wb_.stringTable(), wb_.dateStyles());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pity this diff looks so big, because it's really not. In this section, it's just the addition(s) of CELL_SKIP. Which never comes up, but it seems I have to include all enum values in all the switch()es.

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing major, just a few minor tweaks

for (size_t i = 0; i < colTypes.size(); i++) {
if (colTypes[i] == CELL_BLANK) {
colTypes[i] = CELL_NUMERIC;
}
if (colTypes[i] != CELL_SKIP) {
ncol_noskip++;
}
}

// Rationalize column names w.r.t. types -----------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can delete this comment now (as the function name should be sufficient)

XlsxCell xcell = *it;
if (types[xcell.col()] == CELL_SKIP || xcell.col() >= ncol_) {
it++;
if (types[xcell->col()] == CELL_SKIP || xcell->col() >= ncol_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would all be a little more compact with int j = xcell->col()

// row to write into
int row = xcell.row() - base;
int row = xcell->row() - base;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use i here?

var2 = NA_real_,
var3 = c("aa", "bb", "cc"),
X__1 = NA_real_,
var5 = c(1, 2, 3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing parens should be on its own line

@jennybc jennybc merged commit 4a34a17 into tidyverse:master Feb 13, 2017
@jennybc jennybc deleted the rework-col-types branch February 13, 2017 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants