Skip to content

Scale Job Inputs and Outputs

Jonathan Meyer edited this page Oct 18, 2019 · 6 revisions

Quick Start

Scale provides a number of ways to feed data to your job and to capture outputs for you. Our goal is to always provide you the simplest experience when integrating with Scale, which is evidenced by conventions we've established to interface with Scale. The most common use case is to define one or more required input and produced output files via your Seed interface. Take the following Seed interface, as an example:

{
  "seedVersion": "1.0.0",
  "job": {
    "name": "image-watermark",
    "jobVersion": "0.1.0",
    "packageVersion": "0.1.0",
    "title": "Image Watermarker",
    "description": "Processes an input PNG and outputs watermarked PNG.",
    "maintainer": {
      "name": "John Doe",
      "email": "[email protected]"
    },
    "timeout": 30,
    "interface": {
      "command": "${INPUT_IMAGE} ${OUTPUT_DIR}",
      "inputs": {
        "files": [
          {
            "name": "INPUT_IMAGE"
          }
        ]
      },
      "outputs": {
        "files": [
          {
            "name": "OUTPUT_IMAGE",
            "pattern": "*_watermark.png"
          }
        ]
      }
    },
    "resources": {
      "scalar": [
        {
          "name": "cpus",
          "value": 1
        },
        {
          "name": "mem",
          "value": 64
        }
      ]
    },
    "errors": [
      {
        "code": 1,
        "name": "image-Corrupt-1",
        "description": "Image input is not recognized as a valid PNG.",
        "category": "data"
      },
      {
        "code": 2,
        "name": "algorithm-failure"
      }
    ]
  }
}

Inputs

When Scale launches this job it will provide the following at run-time:

  • Environment variables (most required by Seed: https://ngageoint.github.io/seed/seed.html#variable-injection)
    • ALLOCATED_CPUS: The CPU cores (can be fractional) dedicated
    • ALLOCATED_MEM: The memory (in Mebibytes) allocated
    • INPUT_IMAGE: The absolute path to the input file
    • INPUT_METADATA: The absolute path to Scale metadata in JSON format. This provides detailed information on all inputs that are provided both from a job or the recipe which it is contained in.
    • OUTPUT_DIR: The absolute path to where you must write output files and metadata.
  • Files
    • /scale/input_data/INPUT_IMAGE/unmarked.png: Sample name, but this is the path where it will reside.

Outputs

Scale will provide capture of the output file for you. Your algorithm must write its output image to a name that matches $OUTPUT_DIR/*_watermark.png. Thats all there is to it! There are more advanced options that are documented below.

Reference

For advanced cases, Scale provided a host of metadata that is offered to understand the context within which your job is running. Scale provides inputs and context via both files and environment variables. Environment variables at a minimum are defined by Seed and supplemented by Scale:

  • INPUT_METADATA: The absolute path to Scale contextual data in JSON format. This provides detailed information on all inputs that are provided both from a job or the recipe which it is contained within.

Scale Context

It is a common pattern for Scale jobs to push data to an external data store as a final step within a processing workflow. Doing this often requires a publicly addressable reference back to the Scale product. This can be found within the Scale context file under the url member for an input file. The context file will contain both a JOB and a RECIPE member that includes all the input names within the Seed interface. An example of this for the above watermark job would be:

{
  "JOB": {
    "INPUT_IMAGE": [
      {
        "id": 24,
        "file_name": "sample_image.png",
        "workspace": {
          "id": 1,
          "name": "main-dev-input"
        },
        "data_type_tags": [],
        "media_type": "image/png",
        "file_type": "SOURCE",
        "file_size": 104302,
        "file_path": "ingesting/sample_image.png",
        "is_deleted": false,
        "url": "https://mydomain.com/ingesting/sample_image.png",
        "created": "2019-09-10T20:27:53.031110Z",
        "deleted": null,
        "data_started": null,
        "data_ended": null,
        "source_started": null,
        "source_ended": null,
        "source_sensor_class": null,
        "source_sensor": null,
        "source_collection": null,
        "source_task": null,
        "last_modified": "2019-09-10T20:27:53.104362Z",
        "geometry": null,
        "center_point": null,
        "countries": [],
        "job_type": null,
        "job": null,
        "job_exe": null,
        "job_output": null,
        "recipe_type": null,
        "recipe": null,
        "recipe_node": null,
        "batch": null,
        "is_superseded": false,
        "superseded": null,
        "meta_data": {}
      }
    ]
  }
}

The above sample is merely a serialized form of the data that lives within the scale_file table that tracks all files known by the Scale system. The following metadata sections explain how jobs can populate this data.

Output Metadata

Files are the most basic resource that are consumed and produced by Seed jobs. Scale takes the responsibility to offer and capture them for you. It is very common for geo-temporal data to be associated with a product. It is less common for most data types to natively support this. Seed provides a mechanism to make this data available to Scale by way of the metadata sidecar interface. All that is necessary is to write a GeoJSON formatted metadata file (file name with suffix .metadata.json) alongside any output file and Scale will capture the geometry as well as any associated properties. With our image watermarker example:

  • $OUTPUT_DIR/watermarked.png
  • $OUTPUT_DIR/watermarked.png.metadata.json

Within our watermarked.png.metadata.json to set a geometry and start - end date:

{
  "type": "Feature",
  "geometry": {
    "type": "Point",
    "coordinates": [125.6, 10.1]
  },
  "properties": {
    "dataStarted": "2019-10-14T00:00:00Z",
    "dataEnded": "2019-10-14T00:01:00Z"
  }
}

Source Input Metadata

Scale must ingest files before it is able to offer them to jobs. Doing so provides a very limited set of information (file size, storage location, name, etc.), but for more in depth knowledge, reading of actual data contents is required. This is often performed by a special purpose job that sits at the top of a recipe to extract this data type specific information. This job accepts an input file just like any other job, but will have the unique task of mutating the metadata of an input, as opposed to an output product. Let's take our image watermarker example again, but this time assume we want to augment the metadata of the INPUT_IMAGE file. To do so we write a metadata sidecar in a similar fashion, but it must live in the OUTPUT_DIR. Consider the following files within the container environment:

  • /scale/input_data/INPUT_IMAGE/sample_image.png: Runtime injected input file. Created by Scale.
  • $OUTPUT_DIR/INPUT_IMAGE.metadata.json: Uses the input interface key name for prefix, not the run_time injected file_name. Created by algorithm developer to pass metadata back to Scale.

Our INPUT_IMAGE.metadata.json is of the form:

{
  "type": "Feature",
  "geometry": {
    "type": "Point",
    "coordinates": [125.6, 10.1]
  },
  "properties": {
    "dataStarted": "2019-10-14T00:00:00Z",
    "dataEnded": "2019-10-14T00:01:00Z",
    "dataTypes": [ "one", "two", "three" ]
  }
}

Once captured by Scale on job completion, this will be provided in the Scale Context to any downstream Jobs or conditional evaluations within the parent Recipe.

JSON Output

Clone this wiki locally