Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functionality to copy s3 file directly to local disk without table intermediate #401

Closed
LBragg opened this issue Apr 3, 2021 · 10 comments
Closed
Labels
enhancement 💡 New feature or request
Milestone

Comments

@LBragg
Copy link

LBragg commented Apr 3, 2021

Hi. Thanks for the great package. It's very comprehensive.

I am looking for a function equivalent to boto3's download file (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.download_file) , which can also be done via the AWS cli using 'cp' -- copying a file from an s3 bucket to a local file location.

I may have missed something in the docs, any help would be great. Thanks!

@adambanker
Copy link
Member

Thanks for submitting this issue! Unfortunately, we do not currently have a function like that. The only way we currently support downloads is by retrieving the content in binary format and then writing that to a file as shown here: https://github.com/paws-r/paws/blob/main/examples/s3.R#L66 .

I agree that it would be nice to create a function that supports this use case, so we will put this on our to-do list.

@hadley
Copy link

hadley commented May 19, 2021

I'm interested in this functionality too, and I'd be happy to help implementing it. Do you have any existing examples of paws arguments that don't correspond to API arguments?

@adambanker
Copy link
Member

I added the functionality in #408 but it is not available to use yet - but should be soon. That pull request adds the custom download_file function to our sdk generator package - make.paws. The next time we regenerate the paws packages, that function will be added to paws.storage, and therefore available to easily install.

If you would like to use the functionality before then, you can pull the latest version of main, and regenerate the packages using the makefile - make install. However, we were planning on regenerating the packages this week anyway, so it should be available soon without having to do that.

@hadley
Copy link

hadley commented May 19, 2021

#408 doesn't really solve the problem since the data still has to travel through R — an ideal solution would be for libcurl (via curl via httr) to save the data directly to disk. (I have a dim recommendation that this is currently hard/impossible to do correctly in httr, but I'm happy to help resolve that.)

@adambanker
Copy link
Member

That's a good point. We haven't yet looked into a solution that would bypass R, but that does sound better.

I don't believe we have added an argument to a function that is not an api argument. The closest to that is the wrapper function created in #408. As far as other examples of customizations, there are examples of custom handlers here.

@DyfanJones
Copy link
Member

DyfanJones commented Sep 9, 2021

Hi all,

I took a stab at a possible solution. The approach I took was to pass httr::write_disk to httr::VERB when a output location is given.

Edit: Updated issue function
https://github.com/DyfanJones/paws/blob/0bb0487f34d18bcab8304c4068c08fc30eb0fbf8/paws.common/R/net.R#L97-L140

# Issue an HTTP request.
issue <- function(http_request) {
  method <- http_request$method
  url <- build_url(http_request$url)
  headers <- unlist(http_request$header)
  if (http_request$close) {
    headers["Connection"] <- "close"
  }
  body <- http_request$body
  timeout <- httr::config(connecttimeout = http_request$timeout)
  if (is.null(http_request$timeout)) timeout <- NULL

  if (url == "") {
    stop("no url provided")
  }

  # utilize httr to write to disk
  output <- NULL
  if(!is.null(http_request$output))
    output <- httr::write_disk(http_request$output)

  r <- httr::VERB(
    method,
    url = url,
    config = c(httr::add_headers(.headers=headers), output),
    body = body,
    timeout
  )

  response <- HttpResponse(
    status_code = httr::status_code(r),
    header = httr::headers(r),
    content_length = as.integer(httr::headers(r)$`content-length`),
    body = (if(is.null(http_request$output)) httr::content(r, as = "raw") else raw()) # prevent reading in data when output is not null
  )

  # Decode gzipped response bodies that are not automatically decompressed
  # by httr/curl.
  if (is_compressed(response)) {
    response <- decompress(response)
  }

  return(response)
}

This resulted in modifying s3_download_file to look fairly similar to s3_get_object

https://github.com/DyfanJones/paws/blob/3c84100ed257fe66d6a20c00a54d15136a438f7c/make.paws/R/custom/s3.R#L114-L128

s3_download_file <- function(Bucket, Key, Filename, IfMatch = NULL, IfModifiedSince = NULL, IfNoneMatch = NULL, IfUnmodifiedSince = NULL, Range = NULL, ResponseCacheControl = NULL, ResponseContentDisposition = NULL, ResponseContentEncoding = NULL, ResponseContentLanguage = NULL, ResponseContentType = NULL, ResponseExpires = NULL, VersionId = NULL, SSECustomerAlgorithm = NULL, SSECustomerKey = NULL, SSECustomerKeyMD5 = NULL, RequestPayer = NULL, PartNumber = NULL, ExpectedBucketOwner = NULL) {
  op <- new_operation(
    name = "GetObject",
    http_method = "GET",
    http_path = "/{Bucket}/{Key+}",
    paginator = list()
  )
  input <- .s3$get_object_input(Bucket = Bucket, IfMatch = IfMatch, IfModifiedSince = IfModifiedSince, IfNoneMatch = IfNoneMatch, IfUnmodifiedSince = IfUnmodifiedSince, Key = Key, Range = Range, ResponseCacheControl = ResponseCacheControl, ResponseContentDisposition = ResponseContentDisposition, ResponseContentEncoding = ResponseContentEncoding, ResponseContentLanguage = ResponseContentLanguage, ResponseContentType = ResponseContentType, ResponseExpires = ResponseExpires, VersionId = VersionId, SSECustomerAlgorithm = SSECustomerAlgorithm, SSECustomerKey = SSECustomerKey, SSECustomerKeyMD5 = SSECustomerKeyMD5, RequestPayer = RequestPayer, PartNumber = PartNumber, ExpectedBucketOwner = ExpectedBucketOwner)
  output <- .s3$get_object_output()
  config <- get_config()
  svc <- .s3$service(config)
  request <- new_request(svc, op, input, output, Filename)
  response <- send_request(request)
  return(list())
}

I am not 100% sure if this still returns the content back to R as I don't know the lower level of httr. @hadley would you be able to shed any light on this?

If this is more the right lines I will raise a PR.

@hadley
Copy link

hadley commented Sep 9, 2021

@DyfanJones I've already forgotten how httr works, but the chances are it'll be much easier with httr2

@DyfanJones
Copy link
Member

@DyfanJones I've already forgotten how httr works, but the chances are it'll be much easier with httr2

@hadley no worries, from a quick look at httr, it should call curl::curl_fetch_disk (https://github.com/r-lib/httr/blob/80cb625117629105874d58a234f061b458eefbe6/R/write-function.R#L81-L85) when httr::write_disk is pass to httr::VERB.

httr2 looks really promising :) however I will let the package authors (@davidkretch, @adambanker) make the decision to upgrade from httr 😄 .

I believe this method prevents the data going through R and leaves it to libcurl to do the processing. I will check the unit tests and raise a PR 😄

@DyfanJones
Copy link
Member

PR (#458) has been merged and will be in the next paws release.

@DyfanJones DyfanJones added this to the 0.2.0 milestone Aug 16, 2022
@DyfanJones
Copy link
Member

Closing this ticket due to paws.storage 0.2.0 has been released to the cran.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 💡 New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants