Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10977 Globus Upload optimization - batch file size lookups #11040

Merged
merged 12 commits into from
Dec 2, 2024
Merged
6 changes: 6 additions & 0 deletions doc/release-notes/10977-globus-filesize-lookup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
## A new Globus optimization setting

An optimization has been added for the Globus upload workflow, with a corresponding new database setting: `:GlobusBatchLookupSize`


See the [Database Settings](https://guides.dataverse.org/en/6.5/installation/config.html#GlobusBatchLookupSize) section of the Guides for more information.
7 changes: 7 additions & 0 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4849,6 +4849,13 @@ The URL where the `dataverse-globus <https://github.com/scholarsportal/dataverse

The interval in seconds between Dataverse calls to Globus to check on upload progress. Defaults to 50 seconds (or to 10 minutes, when the ``globus-use-experimental-async-framework`` feature flag is enabled). See :ref:`globus-support` for details.

.. _:GlobusBatchLookupSize:

:GlobusBatchLookupSize
++++++++++++++++++++++

In the initial implementation, when files were added to the dataset upon completion of a Globus upload task, Dataverse would make a separate Globus API call to look up the size of every new file. This proved to be a significant bottleneck at Harvard Dataverse with users transferring batches of many thousands of files (this in turn was made possible by the Globus improvements in v6.4). An optimized lookup mechanism was added in response, where the Globus Service makes a listing API call on the entire remote folder, then populates the file sizes for all the new file entries before passing them to the Ingest service. This approach however may in fact slow things down in a scenario where there are already thousands of files in the Globus folder for the dataset, and only a small number of new files are being added. To address this, the number of files in a batch for which this method should be used was made configurable. If not set, it will default to 50 (a completely arbitrary number). Setting it to 0 will always use this method with Globus uploads. Setting it to some very large number will disable it completely. This was made a database setting, as opposed to a JVM option, in order to make it configurable in real time.

:GlobusSingleFileTransfer
+++++++++++++++++++++++++

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,7 @@ public long retrieveSizeFromMedia() {
JsonArray dataArray = responseJson.getJsonArray("DATA");
if (dataArray != null && dataArray.size() != 0) {
//File found
return (long) responseJson.getJsonArray("DATA").getJsonObject(0).getJsonNumber("size").longValueExact();
return (long) dataArray.getJsonObject(0).getJsonNumber("size").longValueExact();
}
} else {
logger.warning("Response from " + get.getURI().toString() + " was "
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,8 @@ public class AddReplaceFileHelper{
private String newStorageIdentifier; // step 30
private String newCheckSum; // step 30
private ChecksumType newCheckSumType; //step 30

private Long suppliedFileSize = null;

// -- Optional
private DataFile fileToReplace; // step 25

Expand Down Expand Up @@ -610,11 +611,14 @@ private boolean runAddReplacePhase1(Dataset owner,
return false;

}
if(optionalFileParams != null) {
if(optionalFileParams.hasCheckSum()) {
newCheckSum = optionalFileParams.getCheckSum();
newCheckSumType = optionalFileParams.getCheckSumType();
}
if (optionalFileParams != null) {
if (optionalFileParams.hasCheckSum()) {
newCheckSum = optionalFileParams.getCheckSum();
newCheckSumType = optionalFileParams.getCheckSumType();
}
if (optionalFileParams.hasFileSize()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to make fileSize a valid param for direct uploads, etc. now too? I've been hesitant to trust the client about this (can I say my files are 10 bytes and get a lot of free storage?). Would it make sense to allow the Globus code to set this, e.g. passing a boolean through the addFiles call on #1092 that determines whether file size is read here? Not sure we need this or that this is the best way, but thought I'd raise the question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me. To confirm: we'll be allowing Globus to do this, internally; but not accepting these entries in the json passed to /addFiles.

Also (somewhat unrelated) it would be fairly easy to implement a very similar mass lookup on an s3 folder inside an /addFiles call finalizing a large batch of direct s3 uploads. If it is ever registered to be a bottleneck. I never got around to measuring just how much these individual s3 lookups cost, but it always bothered me a little bit that we make separate network/http calls for them. (they definitely do not cost anything remotely approaching what I got with globus and NESE though)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

re: S3: Makes sense. It is nice to have the size lookup in the S3AccessIO class as long as it is working. (I think we have switched to having one shared AWS client, which hopefully keeps an HTTP connection open, across S3AccessIO instances, so hopefully it's not too bad. That said, it wouldn't hurt to see how long it takes when its thousands of files.)

suppliedFileSize = optionalFileParams.getFileSize();
}
}

msgt("step_030_createNewFilesViaIngest");
Expand Down Expand Up @@ -1204,20 +1208,11 @@ private boolean step_030_createNewFilesViaIngest(){
clone = workingVersion.cloneDatasetVersion();
}
try {
/*CreateDataFileResult result = FileUtil.createDataFiles(workingVersion,
this.newFileInputStream,
this.newFileName,
this.newFileContentType,
this.newStorageIdentifier,
this.newCheckSum,
this.newCheckSumType,
this.systemConfig);*/

UploadSessionQuotaLimit quota = null;
if (systemConfig.isStorageQuotasEnforced()) {
quota = fileService.getUploadSessionQuotaLimit(dataset);
}
Command<CreateDataFileResult> cmd = new CreateNewDataFilesCommand(dvRequest, workingVersion, newFileInputStream, newFileName, newFileContentType, newStorageIdentifier, quota, newCheckSum, newCheckSumType);
Command<CreateDataFileResult> cmd = new CreateNewDataFilesCommand(dvRequest, workingVersion, newFileInputStream, newFileName, newFileContentType, newStorageIdentifier, quota, newCheckSum, newCheckSumType, suppliedFileSize);
CreateDataFileResult createDataFilesResult = commandEngine.submit(cmd);
initialFileList = createDataFilesResult.getDataFiles();

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,12 @@
* - Provenance related information
*
* @author rmp553
* @todo (?) We may want to consider renaming this class to DataFileParams or
* DataFileInfo... it was originally created to encode some bits of info -
* the file "tags" specifically, that didn't fit in elsewhere in the normal
* workflow; but it's been expanded to cover pretty much everything else associated
* with DataFiles and it's not really "optional" anymore when, for example, used
* in the direct upload workflow. (?)
*/
public class OptionalFileParams {

Expand Down Expand Up @@ -76,6 +82,8 @@ public class OptionalFileParams {
public static final String MIME_TYPE_ATTR_NAME = "mimeType";
private String checkSumValue;
private ChecksumType checkSumType;
public static final String FILE_SIZE_ATTR_NAME = "fileSize";
private Long fileSize;
public static final String LEGACY_CHECKSUM_ATTR_NAME = "md5Hash";
public static final String CHECKSUM_OBJECT_NAME = "checksum";
public static final String CHECKSUM_OBJECT_TYPE = "@type";
Expand Down Expand Up @@ -268,6 +276,18 @@ public String getCheckSum() {
public ChecksumType getCheckSumType() {
return checkSumType;
}

public boolean hasFileSize() {
return fileSize != null;
}

public Long getFileSize() {
return fileSize;
}

public void setFileSize(long fileSize) {
this.fileSize = fileSize;
}

/**
* Set tags
Expand Down Expand Up @@ -416,7 +436,13 @@ else if ((jsonObj.has(CHECKSUM_OBJECT_NAME)) && (!jsonObj.get(CHECKSUM_OBJECT_NA
this.checkSumType = ChecksumType.fromString(((JsonObject) jsonObj.get(CHECKSUM_OBJECT_NAME)).get(CHECKSUM_OBJECT_TYPE).getAsString());

}

// -------------------------------
// get file size as a Long, if supplied
// -------------------------------
if ((jsonObj.has(FILE_SIZE_ATTR_NAME)) && (!jsonObj.get(FILE_SIZE_ATTR_NAME).isJsonNull())){

this.fileSize = jsonObj.get(FILE_SIZE_ATTR_NAME).getAsLong();
}
// -------------------------------
// get tags
// -------------------------------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,10 @@ public CreateNewDataFilesCommand(DataverseRequest aRequest, DatasetVersion versi
this(aRequest, version, inputStream, fileName, suppliedContentType, newStorageIdentifier, quota, newCheckSum, newCheckSumType, null, null);
}

public CreateNewDataFilesCommand(DataverseRequest aRequest, DatasetVersion version, InputStream inputStream, String fileName, String suppliedContentType, String newStorageIdentifier, UploadSessionQuotaLimit quota, String newCheckSum, DataFile.ChecksumType newCheckSumType, Long newFileSize) {
this(aRequest, version, inputStream, fileName, suppliedContentType, newStorageIdentifier, quota, newCheckSum, newCheckSumType, newFileSize, null);
}

// This version of the command must be used when files are created in the
// context of creating a brand new dataset (from the Add Dataset page):

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@
import edu.harvard.iq.dataverse.util.URLTokenUtil;
import edu.harvard.iq.dataverse.util.UrlSignerUtil;
import edu.harvard.iq.dataverse.util.json.JsonUtil;
import jakarta.json.JsonNumber;
import jakarta.json.JsonReader;
import jakarta.persistence.EntityManager;
import jakarta.persistence.PersistenceContext;
Expand Down Expand Up @@ -284,6 +285,52 @@ private int makeDir(GlobusEndpoint endpoint, String dir) {
return result.status;
}

private Map<String, Long> lookupFileSizes(GlobusEndpoint endpoint, String dir) {
MakeRequestResponse result;

try {
logger.fine("Attempting to look up the contents of the Globus folder "+dir);
URL url = new URL(
"https://transfer.api.globusonline.org/v0.10/operation/endpoint/" + endpoint.getId()
+ "/ls?path=" + dir);
result = makeRequest(url, "Bearer", endpoint.getClientToken(), "GET", null);

switch (result.status) {
case 200:
logger.fine("Looked up directory " + dir + " successfully.");
break;
default:
logger.warning("Status " + result.status + " received when looking up dir " + dir);
logger.fine("Response: " + result.jsonResponse);
return null;
}
} catch (MalformedURLException ex) {
// Misconfiguration
logger.warning("Failed to list the contents of the directory "+ dir + " on endpoint " + endpoint.getId());
return null;
}

Map<String, Long> ret = new HashMap<>();

JsonObject listObject = JsonUtil.getJsonObject(result.jsonResponse);
JsonArray dataArray = listObject.getJsonArray("DATA");

if (dataArray != null && !dataArray.isEmpty()) {
for (int i = 0; i < dataArray.size(); i++) {
String dataType = dataArray.getJsonObject(i).getString("DATA_TYPE", null);
if (dataType != null && dataType.equals("file")) {
// is it safe to assume that any entry with a valid "DATA_TYPE": "file"
// will also have valid "name" and "size" entries?
String fileName = dataArray.getJsonObject(i).getString("name");
long fileSize = dataArray.getJsonObject(i).getJsonNumber("size").longValueExact();
ret.put(fileName, fileSize);
}
}
}

return ret;
}

private int requestPermission(GlobusEndpoint endpoint, Dataset dataset, Permissions permissions) {
Gson gson = new GsonBuilder().create();
MakeRequestResponse result = null;
Expand Down Expand Up @@ -938,9 +985,20 @@ private void processUploadedFiles(JsonArray filesJsonArray, Dataset dataset, Aut

inputList.add(fileId + "IDsplit" + fullPath + "IDsplit" + fileName);
}

Map<String, Long> fileSizeMap = null;

if (filesJsonArray.size() >= systemConfig.getGlobusBatchLookupSize()) {
// Look up the sizes of all the files in the dataset folder, to avoid
// looking them up one by one later:
// @todo: we should only be doing this if this is a managed store, probably (?)
GlobusEndpoint endpoint = getGlobusEndpoint(dataset);
fileSizeMap = lookupFileSizes(endpoint, endpoint.getBasePath());
}

// calculateMissingMetadataFields: checksum, mimetype
JsonObject newfilesJsonObject = calculateMissingMetadataFields(inputList, myLogger);

JsonArray newfilesJsonArray = newfilesJsonObject.getJsonArray("files");
logger.fine("Size: " + newfilesJsonArray.size());
logger.fine("Val: " + JsonUtil.prettyPrint(newfilesJsonArray.getJsonObject(0)));
Expand All @@ -964,20 +1022,26 @@ private void processUploadedFiles(JsonArray filesJsonArray, Dataset dataset, Aut
if (newfileJsonObject != null) {
logger.fine("List Size: " + newfileJsonObject.size());
// if (!newfileJsonObject.get(0).getString("hash").equalsIgnoreCase("null")) {
JsonPatch path = Json.createPatchBuilder()
JsonPatch patch = Json.createPatchBuilder()
.add("/md5Hash", newfileJsonObject.get(0).getString("hash")).build();
fileJsonObject = path.apply(fileJsonObject);
path = Json.createPatchBuilder()
fileJsonObject = patch.apply(fileJsonObject);
patch = Json.createPatchBuilder()
.add("/mimeType", newfileJsonObject.get(0).getString("mime")).build();
fileJsonObject = path.apply(fileJsonObject);
fileJsonObject = patch.apply(fileJsonObject);
// If we already know the size of this file on the Globus end,
// we'll pass it to /addFiles, to avoid looking up file sizes
// one by one:
if (fileSizeMap != null && fileSizeMap.get(fileId) != null) {
Long uploadedFileSize = fileSizeMap.get(fileId);
myLogger.info("Found size for file " + fileId + ": " + uploadedFileSize + " bytes");
patch = Json.createPatchBuilder()
.add("/fileSize", Json.createValue(uploadedFileSize)).build();
fileJsonObject = patch.apply(fileJsonObject);
} else {
logger.fine("No file size entry found for file "+fileId);
}
addFilesJsonData.add(fileJsonObject);
countSuccess++;
// } else {
// globusLogger.info(fileName
// + " will be skipped from adding to dataset by second API due to missing
// values ");
// countError++;
// }
} else {
myLogger.info(fileName
+ " will be skipped from adding to dataset in the final AddReplaceFileHelper.addFiles() call. ");
Expand Down Expand Up @@ -1211,7 +1275,7 @@ private GlobusTaskState globusStatusCheck(GlobusEndpoint endpoint, String taskId
return task;
}

public JsonObject calculateMissingMetadataFields(List<String> inputList, Logger globusLogger)
private JsonObject calculateMissingMetadataFields(List<String> inputList, Logger globusLogger)
throws InterruptedException, ExecutionException, IOException {

List<CompletableFuture<FileDetailsHolder>> hashvalueCompletableFutures = inputList.stream()
Expand All @@ -1230,7 +1294,7 @@ public JsonObject calculateMissingMetadataFields(List<String> inputList, Logger
});

JsonArrayBuilder filesObject = (JsonArrayBuilder) completableFuture.get();

JsonObject output = Json.createObjectBuilder().add("files", filesObject).build();

return output;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -344,10 +344,20 @@ public List<DataFile> saveAndAddFilesToDataset(DatasetVersion version,
try {
StorageIO<DvObject> dataAccess = DataAccess.getStorageIO(dataFile);
//Populate metadata
dataAccess.open(DataAccessOption.READ_ACCESS);
// (the .open() above makes a remote call to check if
// the file exists and obtains its size)
confirmedFileSize = dataAccess.getSize();

// There are direct upload sub-cases where the file size
// is already known at this point. For example, direct uploads
// to S3 that go through the jsf dataset page. Or the Globus
// uploads, where the file sizes are looked up in bulk on
// the completion of the remote upload task.
if (dataFile.getFilesize() > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have 0 length files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so that line wasn't really breaking support for 0-size files - it was only potentially forcing Dataverse to make extra lookups of the size of 0-size files unnecessarily...

confirmedFileSize = dataFile.getFilesize();
} else {
dataAccess.open(DataAccessOption.READ_ACCESS);
// (the .open() above makes a remote call to check if
// the file exists and obtains its size)
confirmedFileSize = dataAccess.getSize();
}

// For directly-uploaded files, we will perform the file size
// limit and quota checks here. Perform them *again*, in
Expand All @@ -362,13 +372,16 @@ public List<DataFile> saveAndAddFilesToDataset(DatasetVersion version,
if (fileSizeLimit == null || confirmedFileSize < fileSizeLimit) {

//set file size
logger.fine("Setting file size: " + confirmedFileSize);
dataFile.setFilesize(confirmedFileSize);
if (dataFile.getFilesize() < 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

< 0 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did stare at this line last night. If getFilesize() == 0 at this point, so is confirmedFileSize, so this shouldn't really matter. But, why not.

logger.fine("Setting file size: " + confirmedFileSize);
dataFile.setFilesize(confirmedFileSize);
}

if (dataAccess instanceof S3AccessIO) {
((S3AccessIO<DvObject>) dataAccess).removeTempTag();
}
savedSuccess = true;
logger.info("directly uploaded file successfully saved. file size: "+dataFile.getFilesize());
}
} catch (IOException ioex) {
logger.warning("Failed to get file size, storage id, or failed to remove the temp tag on the saved S3 object" + dataFile.getStorageIdentifier() + " ("
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -539,6 +539,12 @@ Whether Harvesting (OAI) service is enabled
*
*/
GlobusSingleFileTransfer,
/** Lower limit of the number of files in a Globus upload task where
* the batch mode should be utilized in looking up the file information
* on the remote end node (file sizes, primarily), instead of individual
* lookups.
*/
GlobusBatchLookupSize,
/**
* Optional external executables to run on the metadata for dataverses
* and datasets being published; as an extra validation step, to
Expand Down
6 changes: 6 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/util/SystemConfig.java
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ public class SystemConfig {
public static final long defaultZipDownloadLimit = 104857600L; // 100MB
private static final int defaultMultipleUploadFilesLimit = 1000;
private static final int defaultLoginSessionTimeout = 480; // = 8 hours
private static final int defaultGlobusBatchLookupSize = 50;

private String buildNumber = null;

Expand Down Expand Up @@ -954,6 +955,11 @@ public boolean isGlobusFileDownload() {
return (isGlobusDownload() && settingsService.isTrueForKey(SettingsServiceBean.Key.GlobusSingleFileTransfer, false));
}

public int getGlobusBatchLookupSize() {
String batchSizeOption = settingsService.getValueForKey(SettingsServiceBean.Key.GlobusBatchLookupSize);
return getIntLimitFromStringOrDefault(batchSizeOption, defaultGlobusBatchLookupSize);
}

private Boolean getMethodAvailable(String method, boolean upload) {
String methods = settingsService.getValueForKey(
upload ? SettingsServiceBean.Key.UploadMethods : SettingsServiceBean.Key.DownloadMethods);
Expand Down
Loading