-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
10977 Globus Upload optimization - batch file size lookups #11040
Changes from 7 commits
1d2d776
eef0d22
6293787
a1f0572
536c1bf
617b13a
e67bc6e
e8093c6
3cd9a82
644a524
28e25ea
1325cee
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
## A new Globus optimization setting | ||
|
||
An optimization has been added for the Globus upload workflow, with a corresponding new database setting: `:GlobusBatchLookupSize` | ||
|
||
|
||
See the [Database Settings](https://guides.dataverse.org/en/6.5/installation/config.html#GlobusBatchLookupSize) section of the Guides for more information. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -344,10 +344,20 @@ public List<DataFile> saveAndAddFilesToDataset(DatasetVersion version, | |
try { | ||
StorageIO<DvObject> dataAccess = DataAccess.getStorageIO(dataFile); | ||
//Populate metadata | ||
dataAccess.open(DataAccessOption.READ_ACCESS); | ||
// (the .open() above makes a remote call to check if | ||
// the file exists and obtains its size) | ||
confirmedFileSize = dataAccess.getSize(); | ||
|
||
// There are direct upload sub-cases where the file size | ||
// is already known at this point. For example, direct uploads | ||
// to S3 that go through the jsf dataset page. Or the Globus | ||
// uploads, where the file sizes are looked up in bulk on | ||
// the completion of the remote upload task. | ||
if (dataFile.getFilesize() > 0) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we have 0 length files? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, so that line wasn't really breaking support for 0-size files - it was only potentially forcing Dataverse to make extra lookups of the size of 0-size files unnecessarily... |
||
confirmedFileSize = dataFile.getFilesize(); | ||
} else { | ||
dataAccess.open(DataAccessOption.READ_ACCESS); | ||
// (the .open() above makes a remote call to check if | ||
// the file exists and obtains its size) | ||
confirmedFileSize = dataAccess.getSize(); | ||
} | ||
|
||
// For directly-uploaded files, we will perform the file size | ||
// limit and quota checks here. Perform them *again*, in | ||
|
@@ -362,13 +372,16 @@ public List<DataFile> saveAndAddFilesToDataset(DatasetVersion version, | |
if (fileSizeLimit == null || confirmedFileSize < fileSizeLimit) { | ||
|
||
//set file size | ||
logger.fine("Setting file size: " + confirmedFileSize); | ||
dataFile.setFilesize(confirmedFileSize); | ||
if (dataFile.getFilesize() < 1) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. < 0 ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did stare at this line last night. If |
||
logger.fine("Setting file size: " + confirmedFileSize); | ||
dataFile.setFilesize(confirmedFileSize); | ||
} | ||
|
||
if (dataAccess instanceof S3AccessIO) { | ||
((S3AccessIO<DvObject>) dataAccess).removeTempTag(); | ||
} | ||
savedSuccess = true; | ||
logger.info("directly uploaded file successfully saved. file size: "+dataFile.getFilesize()); | ||
} | ||
} catch (IOException ioex) { | ||
logger.warning("Failed to get file size, storage id, or failed to remove the temp tag on the saved S3 object" + dataFile.getStorageIdentifier() + " (" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to make fileSize a valid param for direct uploads, etc. now too? I've been hesitant to trust the client about this (can I say my files are 10 bytes and get a lot of free storage?). Would it make sense to allow the Globus code to set this, e.g. passing a boolean through the addFiles call on #1092 that determines whether file size is read here? Not sure we need this or that this is the best way, but thought I'd raise the question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine with me. To confirm: we'll be allowing Globus to do this, internally; but not accepting these entries in the json passed to
/addFiles
.Also (somewhat unrelated) it would be fairly easy to implement a very similar mass lookup on an s3 folder inside an
/addFiles
call finalizing a large batch of direct s3 uploads. If it is ever registered to be a bottleneck. I never got around to measuring just how much these individual s3 lookups cost, but it always bothered me a little bit that we make separate network/http calls for them. (they definitely do not cost anything remotely approaching what I got with globus and NESE though)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
re: S3: Makes sense. It is nice to have the size lookup in the S3AccessIO class as long as it is working. (I think we have switched to having one shared AWS client, which hopefully keeps an HTTP connection open, across S3AccessIO instances, so hopefully it's not too bad. That said, it wouldn't hurt to see how long it takes when its thousands of files.)