Functionality to copy s3 file directly to local disk without table intermediate #401

LBragg · 2021-04-03T05:24:11Z

Hi. Thanks for the great package. It's very comprehensive.

I am looking for a function equivalent to boto3's download file (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.download_file) , which can also be done via the AWS cli using 'cp' -- copying a file from an s3 bucket to a local file location.

I may have missed something in the docs, any help would be great. Thanks!

adambanker · 2021-04-03T16:53:45Z

Thanks for submitting this issue! Unfortunately, we do not currently have a function like that. The only way we currently support downloads is by retrieving the content in binary format and then writing that to a file as shown here: https://github.com/paws-r/paws/blob/main/examples/s3.R#L66 .

I agree that it would be nice to create a function that supports this use case, so we will put this on our to-do list.

hadley · 2021-05-19T11:45:50Z

I'm interested in this functionality too, and I'd be happy to help implementing it. Do you have any existing examples of paws arguments that don't correspond to API arguments?

adambanker · 2021-05-19T12:36:26Z

I added the functionality in #408 but it is not available to use yet - but should be soon. That pull request adds the custom download_file function to our sdk generator package - make.paws. The next time we regenerate the paws packages, that function will be added to paws.storage, and therefore available to easily install.

If you would like to use the functionality before then, you can pull the latest version of main, and regenerate the packages using the makefile - make install. However, we were planning on regenerating the packages this week anyway, so it should be available soon without having to do that.

hadley · 2021-05-19T13:05:57Z

#408 doesn't really solve the problem since the data still has to travel through R — an ideal solution would be for libcurl (via curl via httr) to save the data directly to disk. (I have a dim recommendation that this is currently hard/impossible to do correctly in httr, but I'm happy to help resolve that.)

adambanker · 2021-05-19T14:11:16Z

That's a good point. We haven't yet looked into a solution that would bypass R, but that does sound better.

I don't believe we have added an argument to a function that is not an api argument. The closest to that is the wrapper function created in #408. As far as other examples of customizations, there are examples of custom handlers here.

DyfanJones · 2021-09-09T14:00:07Z

Hi all,

I took a stab at a possible solution. The approach I took was to pass httr::write_disk to httr::VERB when a output location is given.

Edit: Updated issue function
https://github.com/DyfanJones/paws/blob/0bb0487f34d18bcab8304c4068c08fc30eb0fbf8/paws.common/R/net.R#L97-L140

# Issue an HTTP request.
issue <- function(http_request) {
  method <- http_request$method
  url <- build_url(http_request$url)
  headers <- unlist(http_request$header)
  if (http_request$close) {
    headers["Connection"] <- "close"
  }
  body <- http_request$body
  timeout <- httr::config(connecttimeout = http_request$timeout)
  if (is.null(http_request$timeout)) timeout <- NULL

  if (url == "") {
    stop("no url provided")
  }

  # utilize httr to write to disk
  output <- NULL
  if(!is.null(http_request$output))
    output <- httr::write_disk(http_request$output)

  r <- httr::VERB(
    method,
    url = url,
    config = c(httr::add_headers(.headers=headers), output),
    body = body,
    timeout
  )

  response <- HttpResponse(
    status_code = httr::status_code(r),
    header = httr::headers(r),
    content_length = as.integer(httr::headers(r)$`content-length`),
    body = (if(is.null(http_request$output)) httr::content(r, as = "raw") else raw()) # prevent reading in data when output is not null
  )

  # Decode gzipped response bodies that are not automatically decompressed
  # by httr/curl.
  if (is_compressed(response)) {
    response <- decompress(response)
  }

  return(response)
}

This resulted in modifying s3_download_file to look fairly similar to s3_get_object

https://github.com/DyfanJones/paws/blob/3c84100ed257fe66d6a20c00a54d15136a438f7c/make.paws/R/custom/s3.R#L114-L128

s3_download_file <- function(Bucket, Key, Filename, IfMatch = NULL, IfModifiedSince = NULL, IfNoneMatch = NULL, IfUnmodifiedSince = NULL, Range = NULL, ResponseCacheControl = NULL, ResponseContentDisposition = NULL, ResponseContentEncoding = NULL, ResponseContentLanguage = NULL, ResponseContentType = NULL, ResponseExpires = NULL, VersionId = NULL, SSECustomerAlgorithm = NULL, SSECustomerKey = NULL, SSECustomerKeyMD5 = NULL, RequestPayer = NULL, PartNumber = NULL, ExpectedBucketOwner = NULL) {
  op <- new_operation(
    name = "GetObject",
    http_method = "GET",
    http_path = "/{Bucket}/{Key+}",
    paginator = list()
  )
  input <- .s3$get_object_input(Bucket = Bucket, IfMatch = IfMatch, IfModifiedSince = IfModifiedSince, IfNoneMatch = IfNoneMatch, IfUnmodifiedSince = IfUnmodifiedSince, Key = Key, Range = Range, ResponseCacheControl = ResponseCacheControl, ResponseContentDisposition = ResponseContentDisposition, ResponseContentEncoding = ResponseContentEncoding, ResponseContentLanguage = ResponseContentLanguage, ResponseContentType = ResponseContentType, ResponseExpires = ResponseExpires, VersionId = VersionId, SSECustomerAlgorithm = SSECustomerAlgorithm, SSECustomerKey = SSECustomerKey, SSECustomerKeyMD5 = SSECustomerKeyMD5, RequestPayer = RequestPayer, PartNumber = PartNumber, ExpectedBucketOwner = ExpectedBucketOwner)
  output <- .s3$get_object_output()
  config <- get_config()
  svc <- .s3$service(config)
  request <- new_request(svc, op, input, output, Filename)
  response <- send_request(request)
  return(list())
}

I am not 100% sure if this still returns the content back to R as I don't know the lower level of httr. @hadley would you be able to shed any light on this?

If this is more the right lines I will raise a PR.

hadley · 2021-09-09T17:28:38Z

@DyfanJones I've already forgotten how httr works, but the chances are it'll be much easier with httr2

DyfanJones · 2021-09-10T08:24:31Z

@DyfanJones I've already forgotten how httr works, but the chances are it'll be much easier with httr2

@hadley no worries, from a quick look at httr, it should call curl::curl_fetch_disk (https://github.com/r-lib/httr/blob/80cb625117629105874d58a234f061b458eefbe6/R/write-function.R#L81-L85) when httr::write_disk is pass to httr::VERB.

httr2 looks really promising :) however I will let the package authors (@davidkretch, @adambanker) make the decision to upgrade from httr 😄 .

I believe this method prevents the data going through R and leaves it to libcurl to do the processing. I will check the unit tests and raise a PR 😄

DyfanJones · 2022-08-16T16:59:07Z

PR (#458) has been merged and will be in the next paws release.

DyfanJones · 2023-02-08T14:44:41Z

Closing this ticket due to paws.storage 0.2.0 has been released to the cran.

davidkretch added the enhancement 💡 New feature or request label Apr 12, 2021

adambanker mentioned this issue May 1, 2021

Creates a custom function for downloading files #408

Merged

adambanker self-assigned this May 1, 2021

davidkretch unassigned adambanker Aug 15, 2021

DyfanJones mentioned this issue Sep 10, 2021

Update custom function for downloading files from s3 #458

Merged

DyfanJones added this to the 0.2.0 milestone Aug 16, 2022

DyfanJones closed this as completed Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functionality to copy s3 file directly to local disk without table intermediate #401

Functionality to copy s3 file directly to local disk without table intermediate #401

LBragg commented Apr 3, 2021

adambanker commented Apr 3, 2021

hadley commented May 19, 2021

adambanker commented May 19, 2021

hadley commented May 19, 2021 •

edited

Loading

adambanker commented May 19, 2021

DyfanJones commented Sep 9, 2021 •

edited

Loading

hadley commented Sep 9, 2021

DyfanJones commented Sep 10, 2021

DyfanJones commented Aug 16, 2022

DyfanJones commented Feb 8, 2023

Functionality to copy s3 file directly to local disk without table intermediate #401

Functionality to copy s3 file directly to local disk without table intermediate #401

Comments

LBragg commented Apr 3, 2021

adambanker commented Apr 3, 2021

hadley commented May 19, 2021

adambanker commented May 19, 2021

hadley commented May 19, 2021 • edited Loading

adambanker commented May 19, 2021

DyfanJones commented Sep 9, 2021 • edited Loading

hadley commented Sep 9, 2021

DyfanJones commented Sep 10, 2021

DyfanJones commented Aug 16, 2022

DyfanJones commented Feb 8, 2023

hadley commented May 19, 2021 •

edited

Loading

DyfanJones commented Sep 9, 2021 •

edited

Loading