Skip to content

Write a Rodan job package

Ling-Xiao Yang edited this page Mar 24, 2016 · 24 revisions

Jobs are modules that do a specific task in a workflow. This can be as simple as converting an arbitrary image into PNG format, or as complex as performing shape analysis and recognition on an image. This section will serve as an introduction on how to write Jobs for Rodan so that they can be used in a workflow.

All job code should be contained in the rodan/jobs directory. There are several sub-directories, like gamera, neon, etc, considered of different job packages. A job package provides its directory under this folder, where multiple Rodan jobs are defined. A job package can define the resource types that are required for its jobs as well.

1. Describe a Rodan Job

A Rodan job is defined by a class that inherits rodan.jobs.base.RodanTask. The class should define the following attributes as its description:

attribute description
name string a unique name within all the jobs provided by the vendor.
author string the author of the job.
description string
settings [JSON Schema](http://json-schema.org/)1 the validation schema that describes the requirements of the job settings.
enabled boolean
category string
interactive boolean indicates whether the job will pause at some point and wait for manual input.2
input_port_types list of Python dictionary
output_port_types list of Python dictionary

1At present, Rodan only supports a JSON object as the topmost structure of settings.

2It is only informative for the users. It does not affect whether the job will pause. The behaviour of the job is determined by the return value of its execution code.

For input_port_types and output_port_types, the following keys should be defined:

key description
name string
resource_types list of string OR lambda: string -> boolean describes all possible resource MIME-types. If provided with a lambda function, Rodan will automatically filter the matched resource types in its registry.
minimum number minimum requirement of the job. 0 indicates no minimum requirement.
maximum number maximum requirement of the job. 0 indicates no maximum requirement.
is_list boolean whether it should take a Resource or a ResourceList.

2. Implement the Job

The execution of a job can have two possible phases: automatic phase and manual phase. In automatic phase, the job is sent to background workers that are distributed on the network; in manual phase, the job communicates with human through a web interface via HTTP protocol.

A job always starts and ends with an automatic phase. It is allowed to go back and fro between automatic phases and manual phases:

The automatic phases are implemented in the method run_my_task (and my_error_information). The manual phases are implemented in the methods get_my_interface and validate_my_user_input.

2.1 Implement Automatic Phases

The signature of method run_my_task should be:

run_my_task(self, inputs, settings, outputs)

This method is expected to read the resource files as described in inputs, process them according to the configuration in settings, and produce the result files at the paths as described in outputs.

The parameter inputs is a Python dictionary. Every key-value pair maps a type of input ports to the list of details of the input resources. The details are Python dictionaries that include:

key value
resource_path string the path to the input resource file
resource_type string the MIME-type of the input resource

If the input port is list-typed (is_list==True), the provided resource list is represented by a list of pairs as above.

For example, if a job is executed with 2 inputs typed "image" (not list-typed) and 2 input typed "mask" (list-typed), the inputs will be structured like:

{
    "image": [{
        "resource_path": "/some/path/file1",
        "resource_type": "image/jpeg"
    }, {
        "resource_path": "/some/path/file2",
        "resource_type": "image/png"
    }],
    "mask": [
        [{
            "resource_path": "/some/path/file3",
            "resource_type": "image/bmp"
        }, {
            "resource_path": "/some/path/file4",
            "resource_type": "image/bmp"
        }, {
            "resource_path": "/some/path/file5",
            "resource_type": "image/bmp"
        }], [{
            "resource_path": "/some/path/file6",
            "resource_type": "image/bmp"
        }, {
            "resource_path": "/some/path/file7",
            "resource_type": "image/bmp"
        }, {
            "resource_path": "/some/path/file8",
            "resource_type": "image/bmp"
        }]
    ]
}

The parameter outputs is alike the parameter inputs, but the detail of resource is a little bit different:

key value
resource_path string the path that is supposed to be written into (only for non-list typed ones)
resource_folder string the path that all files of the resource list are supposed to be written into (only for list-typed ones)
resource_type string the MIME-type of the output resource

For example, if a job is executed with 2 outputs typed "result" (not list-typed) and 2 input typed "aux files" (list-typed), the outputs will be structured like:

{
    "result": [{
        "resource_path": "/some/path/file1",
        "resource_type": "image/jpeg"
    }, {
        "resource_path": "/some/path/file2",
        "resource_type": "image/png"
    }],
    "aux files": [{
        "resource_folder": "/some/path/folder1",
        "resource_type": "image/jpeg"
    }, {
        "resource_folder": "/some/path/folder2",
        "resource_type": "image/png"
    }]
}

Again, in the output object, the resource_path points to a file that does NOT exist, and the resource_folder is empty. The job code should fill them in with output files.

The parameter settings is a Python dictionary that is validated against the JSON schema that the job has defined.

The job can raise any exceptions in automatic phases. By default, the exception message and traceback are as the error summary and details, respectively. This behaviour can be changed by defining the method my_error_information(self, exc, traceback), where exc is the exception object and traceback is a traceback object. The method should return a Python dictionary that includes error_summary and error_details.

If the job needs a temporary directory to work with, the recommended way is:

with self.tempdir() as tempdir:
    # do things inside tempdir

... to avoid producing filesystem garbage upon any exception (including the ones of Celery environment).

run_my_task method can return an instance of self.WAITING_FOR_INPUT to indicate its requirement of a manual phase (see section 2.3). Other types of return value will be ignored and treated as a signal of job completion.

2.2 Implement Manual Phases

In manual phases, the job is put forward to receive and response HTTP requests. Upon a GET request, the job needs to provide its web interface; upon a POST request, the job validates the input data and updates its settings accordingly.

2.2.1 get_my_interface

get_my_interface method returns the web interface. Its signature is:

get_my_interface(self, inputs, settings)

The data structure of argument inputs is alike the counterpart in automatic phases. But in manual phases, inputs provides more details for the interface to locate resource URLs remotely:

key value
resource_path string the path to the input resource file
resource_type string the MIME-type of the input resource
resource_url string the URL to the original resource file
small_thumb_url string the URL to the small thumbnail
medium_thumb_url string the URL to the medium thumbnail
large_thumb_url string the URL to the large thumbnail

The argument settings is structured the same as its automatic counterpart.

get_my_interface method is expected to return a tuple (t, c), where t is the relative path to the template HTML file. The path should be relative to the vendor's package, and the template HTML file should be written in Django template language.

c is a Python dictionary that defines the variables and their values to be rendered in the HTML template.

The interface can reference resource files, like CSS, JS, and images. Resource files need to be placed in the static folder inside the job vendor directory. For example, if a CSS file is placed at static/css/mystyle.css, the HTML template can provide the following link to it:

<link href="static/css/mystyle.css" rel="stylesheet">

Note: If you link external stylesheets and Javascripts provided by a CDN, be sure that the CDN can serve these resources via HTTPS. Otherwise, user's browser will refuse to load HTTP resources if Rodan is served via HTTPS ("Mixed Content" error). The good practice of linking external resources is without protocol type:

<script src="//cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js"></script>
2.2.2 validate_my_user_input

Signature:

validate_my_user_input(self, inputs, settings, user_input)

This method validates the user input through HTTP POST request. The user input is provided as JSON data in user_input. If validation fails, it is expected to raise an instance of self.ManualPhaseException that incurs an HTTP 400 response (with error message) back to the interface.

If validation passes, the method should return a Python dictionary of the update of the settings. All updated keys should start with '@' or they will be discarded (reason see section 2.3). The dictionary can be wrapped as an instance of self.WAITING_FOR_INPUT to let the job stay in the manual phase.

The inputs and settings arguments are structured in the same way as in automatic phases.

2.3 State Transform between Automatic Phases and Manual Phases

A job can have multiple automatic phases and manual phases, but there is only one method run_my_task for all automatic phases and one set of methods get_my_interface and validate_my_user_input for all manual phases. However, run_my_task and validate_my_user_input can modify the settings of the job, and thus provide a clue to determine the exact phase according to the value of settings.

As stated above, run_my_task can return an instance of self.WAITING_FOR_INPUT to launch a manual phase. The update of settings can be performed at this point, like:

# in run_my_task
return self.WAITING_FOR_INPUT({'@field1': newVal1, '@field2': newVal2})

Notice that the fields of updated settings must be prefixed with @, in order not to overwrite the original settings. Fields not starting with @ will be removed.

The job methods should read the @-prefixed settings to determine which exact phase to perform.

Similarly, validate_my_user_input can return an unwrapped Python dictionary of setting updates, like:

# in validate_my_user_input
return {'@field1': newVal1, '@field2': newVal2}

If validate_my_user_input needs to let the job stay in a manual phase, it can also return an instance of self.WAITING_FOR_INPUT. Additionally, it can provide an HTTP response to the interface, like:

# in validate_my_user_input
return self.WAITING_FOR_INPUT({'@field1': newVal1, '@field2': newVal2}, response="Please continue working on this manual phase.")

3. Test the Job

test_my_task(self, testcase)

This method is called during the unit test of Rodan.

This method should call run_my_task() and/or get_my_interface() and/or validate_my_user_input. Before calling the job code, this method needs to construct inputs, settings, and outputs objects as parameters to feed the methods.

Its own parameter testcase refers to the Python TestCase object. Aside from assertion methods like assertEqual() and assertRaises(), it provides new_available_path() which returns a path to a nonexist file in the temporary filesystem. test_my_task method can thus create input files in these paths and feed them into the job methods.

4. Describe Resource Types

The resource MIME-types should be defined for Rodan to recognize them. A vendor can describe the required resource MIME-types through a file resource_types.yaml in the vendor directory. It is a list of mappings, which include:

name description
mimetype string
description (optional) string
extension (optional) string the suggested extension of this resource type.

5. Import the Job

Rodan imports the vendor module according to RODAN_JOB_PACKAGES in settings_production.py. Therefore, it is the vendor's responsibility to import the jobs in outermost __init__.py. It is not necessary to import every class, though -- import the Python file that contains the job classes, and Rodan will find the job classes and register them.

It is safer to use rodan.jobs.module_loader function to import the job modules. module_loader will catch the ImportError and write it into the log file instead of throwing an exception that terminates Rodan.

Clone this wiki locally