Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user to specify col type is just date or time (vs full datetime) #504

Open
billdenney opened this issue Aug 31, 2018 · 4 comments
Open
Labels

Comments

@billdenney
Copy link

readxl/src/ColSpec.h

Lines 129 to 134 in 5fbe997

// Date times: 14-22, 27-36, 45-47, 50-58, 71-81 (inclusive)
if ((id >= 14 && id <= 22) ||
(id >= 27 && id <= 36) ||
(id >= 45 && id <= 47) ||
(id >= 50 && id <= 58) ||
(id >= 71 && id <= 81))

In the above lines of code, the numFmt values, in most cases, specify dates without times. For example:

  • numFmtId 14: mm-dd-yy only has date components, and numFmtId 15 as well
  • 16 and 17 are imprecise dates without all components (16 is missing year; 17 is missing day)
  • 18-21 are time-only formats with no dates
  • 22 is a date-time

All of the above, and their equivalents in other languages, should be returned to the R user with the correct date precision.

For the formats that do not include times, they should be returned as a class Date object while formats that do include times should return as a class POSIXct object. Returning everything as a POSIXct object gives an inaccurate picture of the precision.

Returning more than what is visually presented (especially the time-only formats showing as being on the day 1899-12-31) misrepresents the available data.

The representation of more data than are provided is also related to tidyverse/lubridate#690.

@billdenney billdenney changed the title numFmts Are Not All DateTimes numFmts 12-22 Are Not All DateTimes Aug 31, 2018
@billdenney
Copy link
Author

billdenney commented Sep 1, 2018

As I've been thinking about this more, it may be simpler to use a heuristic like the following:

  • If all values in a column are >=0 and <1, assume that they are time-only values.
    • This heuristic would cause a bug if all values in the column where for the date of 1899-12-31; that seems sufficiently rare compared to a time-only value. It should probably issue a warning for this reason.
    • Time-only values seem best represented to me as a difftime or lubridate period class.
  • If all values in a column are integers, make them date columns (not date-time). This should cause no issue.
  • If any value in the column is >=1 or <0 and any value is not an integer, make it a date-time column.

@jennybc
Copy link
Member

jennybc commented Dec 14, 2018

This is tied up with other issue clusters around column typing, col spec, formats.

Returning everything as a POSIXct object gives an inaccurate picture of the precision. Returning more than what is visually presented (especially the time-only formats showing as being on the day 1899-12-31) misrepresents the available data.

I know what you're getting at, but this isn't technically true. Excel stores all of this as a floating point, serial date time. Full stop. The formats only control what is presented to the user visually. You can switch between all the formats listed above and it does not change the numeric value stored for a cell.

But yes it would be nice for a user to be able to specify they expect a date, with no time, for example. We can't guess this automatically because this would actually throw away data. From my experience in general, I will say that lots of people are deeply confused about and not intentionally managing their cell formats. So guessing col type based on format will cause new problems.

@jennybc jennybc changed the title numFmts 12-22 Are Not All DateTimes Allow user to specify col type is just date or time (vs full datetime) Dec 14, 2018
@billdenney
Copy link
Author

@jennybc, I know that Excel stores everything as a floating point, serial number, but the file specification provides an indication of intent via the format (as defined on page 1777 pdf page 1787 of the standard).

From my experience in general, I will say that lots of people are deeply confused about and not intentionally managing their cell formats. So guessing col type based on format will cause new problems.

My experience is the same as yours, but what I see is that most people expect what is read into R (or more exactly what they are sending me to read into R) is what they see in Excel. If they see a floating point value represented by number format 18, 19, 20, 21, 45, 46, or maybe 47, they think that I will receive something that is a time without a date.

A few things that occur to me for brainstorming ways to handle formats:

  • Do number to string conversion based on the format.
    • With that, number formats that are explicitly just re-representations of numbers (0-4, 9-11, 37-40, and 48) would come to R as numbers and everything else would be formatted as in Excel? Then, the R user could convert to what they want based on the text.
  • Allow the user an option for how to interpret formatting by format code (though that requires some deep knowledge by the user and some very flexible format options in readxl).
  • Provide an option not to interpret number formats and have all columns return with an attribute of the format value.

@cbrnr
Copy link

cbrnr commented Nov 30, 2022

I'm sorry for necroposting, but I wanted to add that currently, col_types=c("date") will not return a date but a datetime. Even the import dialog in RStudio explicitly lists "Date" when clicking on a column header in the preview, but it will still generate a datetime column. Sure it's no problem to convert it with e.g. as.Date() after importing, but it seems confusing (a bug?) so I thought I'd mention it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants