Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty validation error when submitting a processing job, related to requested volume size #4939

Open
btlorch opened this issue Nov 21, 2024 · 2 comments
Labels

Comments

@btlorch
Copy link

btlorch commented Nov 21, 2024

Dear Sagemaker team,

I am experiencing issues when trying to submit a Sagemaker processing job. Job submission fails with the following error:

botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateProcessingJob operation: 

Unfortunately the error message is empty.

After lots of trail and error, I believe that the error is related to the requested volume size. When I request a volume size of 600 GB or below, everything runs smoothly. The issue appears when I request a volume size of 700 GB or above. When I request more than 1024 GB, I receive a different error message:

botocore.errorfactory.ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateProcessingJob operation: The request delta of 1025 GBs for 'Size of EBS volume for a processing job instance' is greater than the account-level service-limit of 1024 GBs. Please use AWS Service Quotas to request an increase for this quota. If AWS Service Quotas is not available, contact AWS support to request an increase for this quota.

If 1024 GB is my account's quota, I suppose that 700 GB should be fine and this is not a quota issue.

Is there a limit that I am not aware of? In any case, I would expect a non-empty error message.

Code to reproduce

Here is a toy example (with some placeholders).

This is the job that should be executed, process.py, a simple Python script that counts the number of JPEG files in a given directory.

import argparse
import json
import os


def find_files_recursively(directory, file_extensions=(".jpeg", ".jpg")):
    """
    Recursively find all the files with the given file extensions
    :param directory: input directory to search recursively
    :param file_extensions: list of file extensions
    :return: iterator that yields absolute paths
    """
    file_extensions = set(file_extensions)
    for root, dirs, files in os.walk(directory):
        for basename in files:
            ext = os.path.splitext(basename)[1].lower()
            if ext in file_extensions:
                filepath = os.path.join(root, basename)
                yield filepath


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--input_dir", type=str, required=True, help="A local directory")
    parser.add_argument("--output_dir", type=str, help="Output directory", default="/opt/ml/processing/output")
    args = vars(parser.parse_args())

    if not os.path.exists(args["input_dir"]):
        print(f"Given input directory {args['input_dir']} does not exist")
        exit(1)

    all_filepaths = list(find_files_recursively(args["input_dir"]))

    print(f"Number of images in {args['input_dir']}: {len(all_filepaths)}")

    # Store result to a JSON file
    result = {
        "input_dir": args["input_dir"],
        "number_of_files": len(all_filepaths),
    }

    output_file = os.path.join(args["output_dir"], "file_counts.json")
    with open(output_file, "w") as f:
        json.dump(result, f, indent=4)

This is how the job is submitted:

from sagemaker.pytorch import PyTorchProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import Session
import os


# Some constants
BASE_DIR = os.path.dirname(__file__)
ROLE = ... # sourced from an environment variable
IMAGE_URI = ... # a custom image

# Bucket for input and output data
bucket = "my-custom-bucket"
sagemaker_session = Session(default_bucket=bucket)


processor = PyTorchProcessor(
    image_uri=IMAGE_URI,
    framework_version=None,
    role=ROLE,
    instance_count=1,
    instance_type="ml.g5.4xlarge",
    volume_size_in_gb=700,
    sagemaker_session=sagemaker_session,
    code_location=f"s3://{bucket}/tmp", # S3 prefix where the code is uploaded to
    base_job_name="debug-sagemaker-processing-700",
    command=["python"],
    env={
        "PYTHONPATH": "/opt/ml/processing/input/code"
    },
)

# Remote paths must begin with "/opt/ml/processing/".
data_root = "/opt/ml/processing/data"
models_dir = "/opt/ml/processing/models"
output_dir = "/opt/ml/processing/output"

processor.run(
    inputs=[
        ProcessingInput(source=f"s3://{bucket}", destination=data_root),
    ],
    outputs=[ProcessingOutput(
        output_name="output",
        source=output_dir,
        destination=f"s3://{bucket}/tmp")
    ],
    code="process.py",
    source_dir=BASE_DIR,
    arguments=[
        "--input_dir", f"{data_root}/data",
        "--output_dir", output_dir,
    ],
    wait=False,
)

print(f"Processing job submitted. Check output in s3://{bucket}/tmp/")

Expected behavior
The submit script should produce the following output:

INFO:sagemaker:Creating processing-job with name debug-sagemaker-processing-700-2024-11-21-14-45-34-336
Processing job submitted. Check output in s3://my-custom-bucket/tmp/

Instead, I am getting the following error message and a stack trace:

botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateProcessingJob operation: 

The error message is empty.

System information

  • Python 3.11
  • boto3 version 1.34.153
@btlorch btlorch added the bug label Nov 21, 2024
@leo4ever
Copy link

@btlorch - In my case, I get the validation exception when I pass multiple inputs (list of ProcessingInput objects)

@btlorch
Copy link
Author

btlorch commented Nov 22, 2024

Thankfully, my colleague found the problem. The problem is that ml.g5.4xlarge instances only have 600 GB of disk space. One solution is to switch to an instance with more disk space.
I'll keep this issue here as a reminder to improve the error message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants