Refactor xlsx col type handling #261

jennybc · 2017-02-12T05:11:27Z

If user wants to skip a column, now we say "skip" instead of "blank". Created a new CellType, CELL_SKIP for this. Existing CellType, CELL_BLANK now really means what it says.

We're using CellType for 2 different concepts that almost coincide but not quite: cell type and column type. I thought about separating them but it didn't seem worth it. A very recent PR (Add col_type = "list" option #256) from a user proposes to implement a "list" type, where each cell can have the type declared in the Excel. And he, possibly out of necessity, had to separate the cell type from column type. Something to think about.
Leading and embedded empty columns are never dropped.
If col_types has length one, recycle it to have length = number of columns.
More effort to make sense of user-specified col_names, if there are also user-specified col_types. Specifically, if column(s) will be skipped, accept full-length col_names or col_names post-skip. I suspect you might not approve of this?

AWKWARD: a few things had to be fixed or improved. But it doesn't seem to make sense to build up a flexible col spec system in readxl, since it should just work like readr (#198). But that would require pulling lots of logic out of readr into a col spec package. Which isn't imminent. 🤔

Even though I'm not sure what it's for.

…untRows()

hadley

Overall strategy looks good.

I'll review again once you've renamed a few variables along my suggestions — I think that will make it easier for me to understand the code flow.

I'd also seriously consider computing and caching the cell type when it's constructed.

hadley · 2017-02-12T16:04:47Z

src/CellType.h

 enum CellType {
  CELL_BLANK,
  CELL_DATE,
  CELL_NUMERIC,
+  CELL_SKIP,


I think it's coincidental that the names were previously in alphabetical order. I think CELL_SKIP should go next to CELL_BLANK.

I just got an education on the importance of the order here. Short version: CELL_BLANK must be first (= default type), but CELL_SKIP can come next.

hadley · 2017-02-12T16:05:21Z

src/CellType.h

    } else if (type == "text") {
      types.push_back(CELL_TEXT);
    } else {
-      Rcpp::warning("Unknown type '%s' at position %i. Using text instead.",
+      Rcpp::warning("Unknown type '%s' at position %i. Using 'text' instead.",


It might be better to just stop here to clarify that error handling happens on the R side.

hadley · 2017-02-12T16:06:38Z

src/XlsWorkSheet.h

        case CELL_NUMERIC:
          switch(type) {
          case CELL_BLANK:
            REAL(col)[i] = NA_REAL;
            break;
          case CELL_NUMERIC:
+          case CELL_SKIP:


Shouldn't this go with CELL_BLANK? (Not that it really matters)

hadley · 2017-02-12T16:08:09Z

src/XlsxWorkSheet.cpp

  std::vector<CellType> colTypes;
  switch(TYPEOF(col_types)) {
  case NILSXP:
    colTypes = ws.colTypes(na, guess_max, sheetHasColumnNames);
    break;
  case STRSXP:
    colTypes = cellTypes(as<CharacterVector>(col_types));
+    if ((int) colTypes.size() == 1) {


I think this would be slightly cleaner if you either use std::fill() or Rcpp::rep()

Used std::fill(). Nice to know about that!

hadley · 2017-02-12T16:10:08Z

src/XlsxWorkSheet.cpp

+  }
+
+  // Rationalize column names w.r.t. types -----------------------------
+  size_t ncol_names = colNames.size();


Maybe pull this out into a separate helper function? Then you'll be able to re-use for xls.

New helper function = reconcileNames().

hadley · 2017-02-12T16:12:05Z

src/XlsxWorkSheet.h

-          }
+      XlsxCell xcell = *it;
+      if (xcell.col() < ncol_) {
+        CellType type = xcell.type(na, wb_.stringTable(), wb_.dateStyles());


I wonder if it might be worthwhile to compute and cache the type in the cell constructor?

One reason to NOT do this: when cells are constructed in loadCells(), we don't know about skipped columns yet. We do the bare minimum: register the cell's existence and its coordinates.

If user supplies col_types and skips columns, we will never read type (or content) for the associated cells (and I've said as much in the docs).

If user does not supply col_types, we learn them. But that is also limited to guess_max rows.

If we cache cell type in the cell constructor, we might do so for a lot of cells that will never need it.

Does this change your opinion?

Oh that's a good point and it does change my opinion

hadley · 2017-02-12T16:13:22Z

src/XlsxWorkSheet.h

-          if (type > types[xcell.col()]) {
-            types[xcell.col()] = type;
-          }
+      XlsxCell xcell = *it;


This will copy the xcell - I think it would be better to rename it to xcell and then use (e.g.) xcell->type()

Done, here and other places.

hadley · 2017-02-12T16:13:50Z

src/XlsxWorkSheet.h

    int base = firstRow_->row() + has_col_names;
    // we have consulted i rows re: determining col types
    int i;
    // account for any empty rows between column headers and data start
    i = it->row() - base;
+    // m is the max row number seen so far
+    int m = it->row();


Calling these similar things i and m feels a bit confusing.

Simplified. m is gone.

hadley · 2017-02-12T16:18:17Z

tests/testthat/test-missing-values.R

+  ## in a trailing empty column WHICH SHOULD BE DROPPED
+  ## in some trailing rows WHICH SHOULD BE DROPPED
+  out <- read_excel(test_sheet("style-only-cells.xlsx"))
+  df <- tibble::tibble(var1 = c("val1,1", "val2,1", "val3,1"),


hadley · 2017-02-12T16:20:47Z

Wrt 4., I think it's ok, but we should explore whether we can do the same thing in readr. I have a vague memory that it's more complicated because you can associate names with types there.

jennybc · 2017-02-13T08:01:05Z

This is ready for re-review.

jennybc · 2017-02-13T08:07:20Z

src/XlsxWorkSheet.h

-        // Needs to compare to actual cell type to give warnings
-        switch(types[xcell.col()]) {
-        case CELL_BLANK:
+      CellType type = xcell->type(na, wb_.stringTable(), wb_.dateStyles());


It's a pity this diff looks so big, because it's really not. In this section, it's just the addition(s) of CELL_SKIP. Which never comes up, but it seems I have to include all enum values in all the switch()es.

hadley

Nothing major, just a few minor tweaks

hadley · 2017-02-13T13:41:11Z

src/XlsxWorkSheet.cpp

  for (size_t i = 0; i < colTypes.size(); i++) {
    if (colTypes[i] == CELL_BLANK) {
      colTypes[i] = CELL_NUMERIC;
    }
-    if (colTypes[i] != CELL_SKIP) {
-      ncol_noskip++;
-    }
  }

  // Rationalize column names w.r.t. types -----------------------------


I think you can delete this comment now (as the function name should be sufficient)

hadley · 2017-02-13T13:43:45Z

src/XlsxWorkSheet.h

-      XlsxCell xcell = *it;
-      if (types[xcell.col()] == CELL_SKIP || xcell.col() >= ncol_) {
-        it++;
+      if (types[xcell->col()] == CELL_SKIP || xcell->col() >= ncol_) {


This would all be a little more compact with int j = xcell->col()

hadley · 2017-02-13T13:44:27Z

src/XlsxWorkSheet.h

      // row to write into
-      int row = xcell.row() - base;
+      int row = xcell->row() - base;


Can you use i here?

hadley · 2017-02-13T13:44:55Z

tests/testthat/test-missing-values.R

+    var2 = NA_real_,
+    var3 = c("aa", "bb", "cc"),
+    X__1 = NA_real_,
+    var5 = c(1, 2, 3))


Trailing parens should be on its own line

tklebel and others added 20 commits February 11, 2017 11:41

add new param n for guessing col_type

d8481ab

Remove all column name processing from colTypes()

2c8e6c2

Move colName and colType comparison out of readCols()

c4b12b0

Stop dropping blank columns; fixes #157

9143dcb

Move tests into more logical places

81643b6

More specific error when col_names or col_types has wrong length

3cf7dc9

Rcpp churn

bb567be

If no data, call it blank right here

0c7e90f

Simplify col_type learning loop

4657868

Sketch my grand plans

54213fd

Check and test col_types

4eeddf2

Simplify readCols() loop

d100d5f

Recycle col_types if length 1; fixes #127

624b02b

Deprecate col_types = "blank"; fixes #260

8459ff9

Rationalize joint processing of col_names + col_types; fixes #81

6ab2d36

Ignore docs for xls format

65cd24b

Fix/update xlsx_col_types()

b5c55ce

Even though I'm not sure what it's for.

Delete benchmarks.cpp, home of the shifty, vestigial parseXml()and co…

80905cf

…untRows()

Add bullets to NEWS

09562b1

README and pkgdown

4a90dec

jennybc requested a review from hadley February 12, 2017 06:07

jennybc mentioned this pull request Feb 12, 2017

Improve error messages #119

Closed

Groom error messages and tests thereof

0d57735

hadley reviewed Feb 12, 2017

View reviewed changes

jennybc added 5 commits February 12, 2017 08:49

Wording in NEWS

3c10a4b

Stop for unknown type

bba7655

Move CELL_SKIP; add comment re: enum order

3dab676

Recycle length-one col_type with std::fill()

5f7f908

Indenting

b200038

jennybc added 4 commits February 12, 2017 20:37

Helper function for reconciling col names

158bbc4

Helper function for recycling col types

7d63e93

Simplify col_type learning loop; avoid cell copy

ee18b43

Simplify and avoid cell copy in readCols() too

5189042

jennybc commented Feb 13, 2017

View reviewed changes

It's guess_max, not max_guess

642e23c

hadley approved these changes Feb 13, 2017

View reviewed changes

gergness mentioned this pull request Feb 13, 2017

Add col_type = "list" option #256

Closed

jennybc added 2 commits February 13, 2017 08:16

Delete comment, test code style

c45a0a7

Cleaner with i, j

6bb4ee3

jennybc merged commit 4a34a17 into tidyverse:master Feb 13, 2017

jennybc deleted the rework-col-types branch February 13, 2017 16:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor xlsx col type handling #261

Refactor xlsx col type handling #261

jennybc commented Feb 12, 2017 •

edited

Loading

hadley left a comment

hadley Feb 12, 2017

jennybc Feb 12, 2017

hadley Feb 12, 2017

jennybc Feb 13, 2017

hadley Feb 12, 2017

hadley Feb 12, 2017

jennybc Feb 13, 2017

hadley Feb 12, 2017

jennybc Feb 13, 2017

hadley Feb 12, 2017

jennybc Feb 12, 2017 •

edited

Loading

hadley Feb 12, 2017

hadley Feb 12, 2017

jennybc Feb 13, 2017

hadley Feb 12, 2017

jennybc Feb 13, 2017

hadley Feb 12, 2017

jennybc Feb 13, 2017

hadley commented Feb 12, 2017

jennybc commented Feb 13, 2017

jennybc Feb 13, 2017

hadley left a comment

hadley Feb 13, 2017

hadley Feb 13, 2017

hadley Feb 13, 2017

hadley Feb 13, 2017

Refactor xlsx col type handling #261

Refactor xlsx col type handling #261

Conversation

jennybc commented Feb 12, 2017 • edited Loading

hadley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jennybc Feb 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hadley commented Feb 12, 2017

jennybc commented Feb 13, 2017

Choose a reason for hiding this comment

hadley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jennybc commented Feb 12, 2017 •

edited

Loading

jennybc Feb 12, 2017 •

edited

Loading