-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Functionality to copy s3 file directly to local disk without table intermediate #401
Comments
Thanks for submitting this issue! Unfortunately, we do not currently have a function like that. The only way we currently support downloads is by retrieving the content in binary format and then writing that to a file as shown here: https://github.com/paws-r/paws/blob/main/examples/s3.R#L66 . I agree that it would be nice to create a function that supports this use case, so we will put this on our to-do list. |
I'm interested in this functionality too, and I'd be happy to help implementing it. Do you have any existing examples of paws arguments that don't correspond to API arguments? |
I added the functionality in #408 but it is not available to use yet - but should be soon. That pull request adds the custom If you would like to use the functionality before then, you can pull the latest version of |
#408 doesn't really solve the problem since the data still has to travel through R — an ideal solution would be for libcurl (via curl via httr) to save the data directly to disk. (I have a dim recommendation that this is currently hard/impossible to do correctly in httr, but I'm happy to help resolve that.) |
That's a good point. We haven't yet looked into a solution that would bypass R, but that does sound better. I don't believe we have added an argument to a function that is not an api argument. The closest to that is the wrapper function created in #408. As far as other examples of customizations, there are examples of custom handlers here. |
Hi all, I took a stab at a possible solution. The approach I took was to pass Edit: Updated # Issue an HTTP request.
issue <- function(http_request) {
method <- http_request$method
url <- build_url(http_request$url)
headers <- unlist(http_request$header)
if (http_request$close) {
headers["Connection"] <- "close"
}
body <- http_request$body
timeout <- httr::config(connecttimeout = http_request$timeout)
if (is.null(http_request$timeout)) timeout <- NULL
if (url == "") {
stop("no url provided")
}
# utilize httr to write to disk
output <- NULL
if(!is.null(http_request$output))
output <- httr::write_disk(http_request$output)
r <- httr::VERB(
method,
url = url,
config = c(httr::add_headers(.headers=headers), output),
body = body,
timeout
)
response <- HttpResponse(
status_code = httr::status_code(r),
header = httr::headers(r),
content_length = as.integer(httr::headers(r)$`content-length`),
body = (if(is.null(http_request$output)) httr::content(r, as = "raw") else raw()) # prevent reading in data when output is not null
)
# Decode gzipped response bodies that are not automatically decompressed
# by httr/curl.
if (is_compressed(response)) {
response <- decompress(response)
}
return(response)
} This resulted in modifying s3_download_file <- function(Bucket, Key, Filename, IfMatch = NULL, IfModifiedSince = NULL, IfNoneMatch = NULL, IfUnmodifiedSince = NULL, Range = NULL, ResponseCacheControl = NULL, ResponseContentDisposition = NULL, ResponseContentEncoding = NULL, ResponseContentLanguage = NULL, ResponseContentType = NULL, ResponseExpires = NULL, VersionId = NULL, SSECustomerAlgorithm = NULL, SSECustomerKey = NULL, SSECustomerKeyMD5 = NULL, RequestPayer = NULL, PartNumber = NULL, ExpectedBucketOwner = NULL) {
op <- new_operation(
name = "GetObject",
http_method = "GET",
http_path = "/{Bucket}/{Key+}",
paginator = list()
)
input <- .s3$get_object_input(Bucket = Bucket, IfMatch = IfMatch, IfModifiedSince = IfModifiedSince, IfNoneMatch = IfNoneMatch, IfUnmodifiedSince = IfUnmodifiedSince, Key = Key, Range = Range, ResponseCacheControl = ResponseCacheControl, ResponseContentDisposition = ResponseContentDisposition, ResponseContentEncoding = ResponseContentEncoding, ResponseContentLanguage = ResponseContentLanguage, ResponseContentType = ResponseContentType, ResponseExpires = ResponseExpires, VersionId = VersionId, SSECustomerAlgorithm = SSECustomerAlgorithm, SSECustomerKey = SSECustomerKey, SSECustomerKeyMD5 = SSECustomerKeyMD5, RequestPayer = RequestPayer, PartNumber = PartNumber, ExpectedBucketOwner = ExpectedBucketOwner)
output <- .s3$get_object_output()
config <- get_config()
svc <- .s3$service(config)
request <- new_request(svc, op, input, output, Filename)
response <- send_request(request)
return(list())
} I am not 100% sure if this still returns the content back to R as I don't know the lower level of If this is more the right lines I will raise a PR. |
@DyfanJones I've already forgotten how httr works, but the chances are it'll be much easier with httr2 |
@hadley no worries, from a quick look at
I believe this method prevents the data going through R and leaves it to libcurl to do the processing. I will check the unit tests and raise a PR 😄 |
PR (#458) has been merged and will be in the next paws release. |
Closing this ticket due to paws.storage 0.2.0 has been released to the cran. |
Hi. Thanks for the great package. It's very comprehensive.
I am looking for a function equivalent to boto3's download file (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.download_file) , which can also be done via the AWS cli using 'cp' -- copying a file from an s3 bucket to a local file location.
I may have missed something in the docs, any help would be great. Thanks!
The text was updated successfully, but these errors were encountered: