Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_ext is mispecified #350

Closed
chainsawriot opened this issue Sep 7, 2023 · 2 comments
Closed

get_ext is mispecified #350

chainsawriot opened this issue Sep 7, 2023 · 2 comments

Comments

@chainsawriot
Copy link
Collaborator

chainsawriot commented Sep 7, 2023

rio/R/utils.R

Lines 11 to 30 in f0198a3

get_ext <- function(file) {
if (!is.character(file)) {
stop("'file' is not a string")
}
if (!grepl("^http.*://", file)) {
fmt <- tools::file_ext(file)
} else if (grepl("^http.*://", file)) {
parsed <- strsplit(strsplit(file, "?", fixed = TRUE)[[1]][1], "/", fixed = TRUE)[[1]]
file <- parsed[length(parsed)]
fmt <- tools::file_ext(file)
get_type(fmt)
}
if (file == "clipboard") {
return("clipboard")
} else if (fmt == "") {
stop("'file' has no extension", call. = FALSE)
} else {
return(tolower(fmt))
}
}

The function actually is not actually doing what #169 and the documentation ("A characters string containing a file type recognized by rio.") said and what @billdenney requested. Actually, "file type" is confusing. Let's call this format (as in the argument of import). The main issue is that get_type() is just running inside get_ext() without assigning it to anything.

To give an example:

rio::get_ext("hello.arff")
#> [1] "arff"
rio::get_ext("hello.matlab")
#> [1] "matlab"

rio::get_ext("hello.yaml")
#> [1] "yaml"
rio::get_ext("hello.yml")
#> [1] "yml"

Created on 2023-09-07 with reprex v2.0.2

It is almost the same as tools::file_ext, just with an additional care of URL.

Also a consequence to this, get_ext("twotables.htm") actually return "htm". And therefore, this bunch of check for html wont work

rio/R/import_list.R

Lines 115 to 130 in f0198a3

if (!get_ext(file) %in% c("html", "xlsx", "xls", "zip")) {
which <- 1
whichnames <- NULL
}
## getting list of `whichnames`
if (get_ext(file) == "html") {
.check_pkg_availability("xml2")
tables <- xml2::xml_find_all(xml2::read_html(unclass(file)), ".//table")
if (missing(which)) {
which <- seq_along(tables)
}
whichnames <- vapply(xml2::xml_attrs(tables[which]),
function(x) if ("class" %in% names(x)) x["class"] else "",
FUN.VALUE = character(1))
names(which) <- whichnames
}

temp_htm <- tempfile(fileext = ".htm")
download.file("https://github.com/gesistsa/rio/blob/main/tests/testdata/twotables.html", temp_htm, quiet = TRUE)

dat <- rio::import_list(temp_htm) ## only one table is imported

## failed
testthat::expect_true(identical(dim(dat[[2]]), dim(iris)))
#> Error in dat[[2]]: subscript out of bounds

Created on 2023-09-08 with reprex v2.0.2

May I ask you @billdenney : What did you really want in the first place? You want it to be like tools::file_ext() (I get confused by the function name get_ext()) or to return format.

@billdenney
Copy link
Contributor

@chainsawriot, That's a good question, and I understand the complexity you're asking about.

My main goal was the slight extension of what is done in tools::file_ext(). But, if we had something that could detect the real file format, from your example, returning html for an htm file extension, I see a benefit of that, too. I think that your idea to normalize file extension with expected file format ("htm" becomes "html", "yml" becomes "yaml") would be more beneficial.

@chainsawriot
Copy link
Collaborator Author

chainsawriot commented Sep 8, 2023

TODO

  • Create a user-facing function get_info() that queries the internal data to get complete information (ext, format, format_name, input_function, export_function) of file input
  • Keep get_ext() by wrapping get_info(), something like get_info(file)$ext.

Need to coordinate with #313 #351

chainsawriot added a commit that referenced this issue Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants