From 58bc8b3142c7480f910f20224aacc3aa1340b5bc Mon Sep 17 00:00:00 2001 From: Jan Range Date: Mon, 26 Aug 2024 09:25:16 +0200 Subject: [PATCH 1/4] add instructions for async streaming to S3 --- .../developers/s3-direct-upload-api.rst | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/doc/sphinx-guides/source/developers/s3-direct-upload-api.rst b/doc/sphinx-guides/source/developers/s3-direct-upload-api.rst index 33b8e434e6e..3ea524c67bf 100644 --- a/doc/sphinx-guides/source/developers/s3-direct-upload-api.rst +++ b/doc/sphinx-guides/source/developers/s3-direct-upload-api.rst @@ -102,7 +102,24 @@ If the client is unable to complete the multipart upload, it should call the abo curl -X DELETE "$SERVER_URL/api/datasets/mpload?..." - + +.. _direct-async-upload: + +Asynchronous Direct Upload +-------------------------- + +When uploading files or chunks asynchronously to S3 via `PUT`, ensure that the `Content-Length` header is set to the size of the file or chunk. If `Content-Length` is not specified, the server will return a `501 Not Implemented` error, causing the upload to fail. + +Example `curl` command: + +.. code-block:: bash + + curl -X PUT "" \ + -H "Content-Length: 5242880" \ + -T + +Replace ``""`` with your actual pre-signed S3 URL and ```` with the file or chunk you are uploading. The``Content-Length`` value should match the size of the file or chunk in bytes. Please note that other required headers mentioned above should also be included in the request. These have been omitted from the example for clarity. + .. _direct-add-to-dataset-api: Adding the Uploaded File to the Dataset From 200d1d1ae056c531054a0c8c93553a6cfd2b597b Mon Sep 17 00:00:00 2001 From: Jan Range Date: Mon, 26 Aug 2024 09:35:02 +0200 Subject: [PATCH 2/4] use `streaming` instead if `uploading` --- doc/sphinx-guides/source/developers/s3-direct-upload-api.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/sphinx-guides/source/developers/s3-direct-upload-api.rst b/doc/sphinx-guides/source/developers/s3-direct-upload-api.rst index 3ea524c67bf..efe7efe9fec 100644 --- a/doc/sphinx-guides/source/developers/s3-direct-upload-api.rst +++ b/doc/sphinx-guides/source/developers/s3-direct-upload-api.rst @@ -109,6 +109,7 @@ Asynchronous Direct Upload -------------------------- When uploading files or chunks asynchronously to S3 via `PUT`, ensure that the `Content-Length` header is set to the size of the file or chunk. If `Content-Length` is not specified, the server will return a `501 Not Implemented` error, causing the upload to fail. +When streaming files or chunks asynchronously to S3 via `PUT`, ensure that the `Content-Length` header is set to the size of the file or chunk. If `Content-Length` is not specified, the server will return a `501 Not Implemented` error, causing the upload to fail. Example `curl` command: From a0db6c03c03c97ea98180503ce1cbbeab2ac5568 Mon Sep 17 00:00:00 2001 From: Jan Range Date: Mon, 26 Aug 2024 16:08:19 +0200 Subject: [PATCH 3/4] convert section into notes - Added section "Upload to S3" that was missing - Removed trailing whitespaces --- .../developers/s3-direct-upload-api.rst | 49 +++++++------------ 1 file changed, 19 insertions(+), 30 deletions(-) diff --git a/doc/sphinx-guides/source/developers/s3-direct-upload-api.rst b/doc/sphinx-guides/source/developers/s3-direct-upload-api.rst index efe7efe9fec..8afe5b39ca4 100644 --- a/doc/sphinx-guides/source/developers/s3-direct-upload-api.rst +++ b/doc/sphinx-guides/source/developers/s3-direct-upload-api.rst @@ -18,7 +18,7 @@ Direct upload involves a series of three activities, each involving interacting This API is only enabled when a Dataset is configured with a data store supporting direct S3 upload. Administrators should be aware that partial transfers, where a client starts uploading the file/parts of the file and does not contact the server to complete/cancel the transfer, will result in data stored in S3 that is not referenced in the Dataverse installation (e.g. should be considered temporary and deleted.) - + Requesting Direct Upload of a DataFile -------------------------------------- To initiate a transfer of a file to S3, make a call to the Dataverse installation indicating the size of the file to upload. The response will include a pre-signed URL(s) that allow the client to transfer the file. Pre-signed URLs include a short-lived token authorizing the action represented by the URL. @@ -29,7 +29,7 @@ To initiate a transfer of a file to S3, make a call to the Dataverse installatio export SERVER_URL=https://demo.dataverse.org export PERSISTENT_IDENTIFIER=doi:10.5072/FK27U7YBV export SIZE=1000000000 - + curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/datasets/:persistentId/uploadurls?persistentId=$PERSISTENT_IDENTIFIER&size=$SIZE" The response to this call, assuming direct uploads are enabled, will be one of two forms: @@ -71,7 +71,12 @@ The call will return a 400 (BAD REQUEST) response if the file is larger than wha In the example responses above, the URLs, which are very long, have been omitted. These URLs reference the S3 server and the specific object identifier that will be used, starting with, for example, https://demo-dataverse-bucket.s3.amazonaws.com/10.5072/FK2FOQPJS/177883b000e-49cedef268ac?... -The client must then use the URL(s) to PUT the file, or if the file is larger than the specified partSize, parts of the file. +.. _direct-upload-to-s3: + +Upload files to S3 +------------------ + +The client must then use the URL(s) to PUT the file, or if the file is larger than the specified partSize, parts of the file. In the single part case, only one call to the supplied URL is required: @@ -88,38 +93,22 @@ Or, if you have disabled S3 tagging (see :ref:`s3-tagging`), you should omit the Note that without the ``-i`` flag, you should not expect any output from the command above. With the ``-i`` flag, you should expect to see a "200 OK" response. In the multipart case, the client must send each part and collect the 'eTag' responses from the server. The calls for this are the same as the one for the single part case except that each call should send a slice of the total file, with the last part containing the remaining bytes. -The responses from the S3 server for these calls will include the 'eTag' for the uploaded part. +The responses from the S3 server for these calls will include the 'eTag' for the uploaded part. To successfully conclude the multipart upload, the client must call the 'complete' URI, sending a json object including the part eTags: .. code-block:: bash curl -X PUT "$SERVER_URL/api/datasets/mpload?..." -d '{"1":"","2":"","3":"","4":"","5":""}' - -If the client is unable to complete the multipart upload, it should call the abort URL: - -.. code-block:: bash - - curl -X DELETE "$SERVER_URL/api/datasets/mpload?..." - - -.. _direct-async-upload: -Asynchronous Direct Upload --------------------------- - -When uploading files or chunks asynchronously to S3 via `PUT`, ensure that the `Content-Length` header is set to the size of the file or chunk. If `Content-Length` is not specified, the server will return a `501 Not Implemented` error, causing the upload to fail. -When streaming files or chunks asynchronously to S3 via `PUT`, ensure that the `Content-Length` header is set to the size of the file or chunk. If `Content-Length` is not specified, the server will return a `501 Not Implemented` error, causing the upload to fail. - -Example `curl` command: +If the client is unable to complete the multipart upload, it should call the abort URL: .. code-block:: bash - curl -X PUT "" \ - -H "Content-Length: 5242880" \ - -T + curl -X DELETE "$SERVER_URL/api/datasets/mpload?..." -Replace ``""`` with your actual pre-signed S3 URL and ```` with the file or chunk you are uploading. The``Content-Length`` value should match the size of the file or chunk in bytes. Please note that other required headers mentioned above should also be included in the request. These have been omitted from the example for clarity. +.. note:: + If you encounter an ``HTTP 501 Not Implemented`` error, ensure the ``Content-Length`` header is correctly set to the file or chunk size. This issue may arise when streaming files or chunks asynchronously to S3 via ``PUT`` requests, particularly if the library or tool you're using doesn't set the ``Content-Length`` header automatically. .. _direct-add-to-dataset-api: @@ -132,10 +121,10 @@ jsonData normally includes information such as a file description, tags, provena * "storageIdentifier" - String, as specified in prior calls * "fileName" - String * "mimeType" - String -* fixity/checksum: either: +* fixity/checksum: either: * "md5Hash" - String with MD5 hash value, or - * "checksum" - Json Object with "@type" field specifying the algorithm used and "@value" field with the value from that algorithm, both Strings + * "checksum" - Json Object with "@type" field specifying the algorithm used and "@value" field with the value from that algorithm, both Strings The allowed checksum algorithms are defined by the edu.harvard.iq.dataverse.DataFile.CheckSumType class and currently include MD5, SHA-1, SHA-256, and SHA-512 @@ -147,7 +136,7 @@ The allowed checksum algorithms are defined by the edu.harvard.iq.dataverse.Data export JSON_DATA="{'description':'My description.','directoryLabel':'data/subdir1','categories':['Data'], 'restrict':'false', 'storageIdentifier':'s3://demo-dataverse-bucket:176e28068b0-1c3f80357c42', 'fileName':'file1.txt', 'mimeType':'text/plain', 'checksum': {'@type': 'SHA-1', '@value': '123456'}}" curl -X POST -H "X-Dataverse-key: $API_TOKEN" "$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_IDENTIFIER" -F "jsonData=$JSON_DATA" - + Note that this API call can be used independently of the others, e.g. supporting use cases in which the file already exists in S3/has been uploaded via some out-of-band method. Enabling out-of-band uploads is described at :ref:`file-storage` in the Configuration Guide. With current S3 stores the object identifier must be in the correct bucket for the store, include the PID authority/identifier of the parent dataset, and be guaranteed unique, and the supplied storage identifier must be prefaced with the store identifier used in the Dataverse installation, as with the internally generated examples above. @@ -191,10 +180,10 @@ jsonData normally includes information such as a file description, tags, provena * "storageIdentifier" - String, as specified in prior calls * "fileName" - String * "mimeType" - String -* fixity/checksum: either: +* fixity/checksum: either: * "md5Hash" - String with MD5 hash value, or - * "checksum" - Json Object with "@type" field specifying the algorithm used and "@value" field with the value from that algorithm, both Strings + * "checksum" - Json Object with "@type" field specifying the algorithm used and "@value" field with the value from that algorithm, both Strings The allowed checksum algorithms are defined by the edu.harvard.iq.dataverse.DataFile.CheckSumType class and currently include MD5, SHA-1, SHA-256, and SHA-512. Note that the API call does not validate that the file matches the hash value supplied. If a Dataverse instance is configured to validate file fixity hashes at publication time, a mismatch would be caught at that time and cause publication to fail. @@ -207,7 +196,7 @@ Note that the API call does not validate that the file matches the hash value su export JSON_DATA='{"description":"My description.","directoryLabel":"data/subdir1","categories":["Data"], "restrict":"false", "forceReplace":"true", "storageIdentifier":"s3://demo-dataverse-bucket:176e28068b0-1c3f80357c42", "fileName":"file1.txt", "mimeType":"text/plain", "checksum": {"@type": "SHA-1", "@value": "123456"}}' curl -X POST -H "X-Dataverse-key: $API_TOKEN" "$SERVER_URL/api/files/$FILE_IDENTIFIER/replace" -F "jsonData=$JSON_DATA" - + Note that this API call can be used independently of the others, e.g. supporting use cases in which the file already exists in S3/has been uploaded via some out-of-band method. Enabling out-of-band uploads is described at :ref:`file-storage` in the Configuration Guide. With current S3 stores the object identifier must be in the correct bucket for the store, include the PID authority/identifier of the parent dataset, and be guaranteed unique, and the supplied storage identifier must be prefaced with the store identifier used in the Dataverse installation, as with the internally generated examples above. From a476adae76bc55fa22f622aec781e00e7c0eff96 Mon Sep 17 00:00:00 2001 From: Philip Durbin Date: Mon, 26 Aug 2024 15:04:04 -0400 Subject: [PATCH 4/4] use Title Case for headings #10798 --- doc/sphinx-guides/source/developers/s3-direct-upload-api.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/sphinx-guides/source/developers/s3-direct-upload-api.rst b/doc/sphinx-guides/source/developers/s3-direct-upload-api.rst index 8afe5b39ca4..a8f87f13375 100644 --- a/doc/sphinx-guides/source/developers/s3-direct-upload-api.rst +++ b/doc/sphinx-guides/source/developers/s3-direct-upload-api.rst @@ -73,7 +73,7 @@ In the example responses above, the URLs, which are very long, have been omitted .. _direct-upload-to-s3: -Upload files to S3 +Upload Files to S3 ------------------ The client must then use the URL(s) to PUT the file, or if the file is larger than the specified partSize, parts of the file.