Skip to content

jhu-idc/derivative-ms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Derivative Microservices

Essentially this repository contains a re-write of the Islandora microservices: houdini, homarus, hypercube, and FITS (a TODO). It should be considered prototype-level quality. The microservices use STOMP to communicate with ActiveMQ. AMQ-4710 is a long-standing bug impacting the reliability of STOMP clients, so use of these microservices requires a patched version of ActiveMQ.

Usage

$ ./derivative-ms -h
Usage of /Users/esm/go/bin/derivative-ms:
  -ack string
        STOMP acknowledgment mode, e.g. 'client' or 'auto' (default "client")
  -config string
        Path to handler configuration file
  -host string
        STOMP broker host name, e.g. 'islandora-idc.traefik.me' (default "localhost")
  -pass string
        STOMP broker password
  -port int
        STOMP broker port (default 61613)
  -queue string
        Queue to read messages from, e.g. 'islandora-connector-homarus' or 'ActiveMQ.DLQ'
  -user string
        STOMP broker user name
Argument Required Default Description
ack yes client STOMP message acknowledgement mode
config no embedded config path to microservice handler configuration file
host yes localhost STOMP broker host name
port yes 61613 STOMP broker port
user no "" STOMP broker user name
pass no "" STOMP broker password
queue yes "" STOMP queue to listen to

Environment Variables

Environment Variable Required Default Description
DERIVATIVE_HANDLER_CONFIG no `` (the empty string) Absolute path to the application configuration file. See the Handler Configuration below for how this env var is used.
DERIVATIVE_DIAL_TIMEOUT_SECONDS no 30 seconds Attempts to connect to the message broker will fail after DERIVATIVE_DIAL_TIMEOUT_SECONDS. If the broker starts up slowly, this timeout may need to be increased.
DRUPAL_JWT_PUBLIC_KEY no `` (the empty string) The PEM-encoded RSA public key used to authenticate Drupal-issued JSON web tokens. If no value is provided, JWTs cannot be validated. This may cause the application to reject messages depending on the configuration of the JWTHandler.
DRUPAL_JWT_PRIVATE_KEY no `` (the empty string) The PEM-encoded RSA private key used by Drupal to sign JSON web tokens. Currently this variable is unused, as Drupal uses RS256, an asymmetric signing algorithm using public and private keys. DRUPAL_JWT_PRIVATE_KEY is only used if a symmetric signing algorithm like HS2565 is used.

Handler Configuration

Handlers are configured in a JSON file, and the application includes a default configuration that is embedded in the application itself. So if the default configuration is suitable, then no external configuration file needs to be provided.

Configuration is specified (in order of decreasing precedence):

  • by the -config command argument
  • by the DERIVATIVE_HANDLER_CONFIG environment variable
  • default embedded configuration

The embedded default configuration is below (the current version is found here:

{
  "jwt-logger": {
    "handler-type": "JWTLoggingHandler",
    "order": 10
  },
  "jwt": {
    "handler-type": "JWTHandler",
    "order": 30,
    "requireTokens": true,
    "verifyTokens": true
  },
  "convert": {
    "handler-type": "ImageMagickHandler",
    "order": 50,
    "commandPath": "/usr/local/bin/convert",
    "defaultMediaType": "image/jpeg",
    "acceptedFormats": [
      "image/jpeg",
      "image/png",
      "image/tiff",
      "image/jp2"
    ]
  },
  "ffmpeg": {
    "handler-type": "FFMpegHandler",
    "order": 60,
    "commandPath": "/usr/local/bin/ffmpeg",
    "defaultMediaType": "video/mp4",
    "acceptedFormatsMap": {
      "video/mp4": "mp4",
      "video/x-msvideo": "avi",
      "video/ogg": "ogg",
      "audio/x-wav": "wav",
      "audio/mpeg": "mp3",
      "audio/aac": "m4a",
      "image/jpeg": "image2pipe",
      "image/png": "png_image2pipe"
    }
  },
  "tesseract": {
    "handler-type": "TesseractHandler",
    "order": 70,
    "commandPath": "/usr/local/bin/tesseract"
  },
  "pdf2txt": {
    "handler-type": "Pdf2TextHandler",
    "order": 80,
    "commandPath": "/usr/local/bin/pdftotext"
  }
}

Each handler is configured with a unique key, type, and a positive integer that reflects the overall order in which it is invoked.

Handlers may be customized by creating a configuration file based on the embedded configuration shown above. The embedded configuration ought to be copied to a file and edited as needed. To use the external configuration, either create an environment variable named DERIVATIVE_HANDLER_CONFIG with the absolute path to the configuration, or supply the absolute path to the configuration on the command line as an argument to -config.

Handlers

Handlers are responsible for performing some action based on a received message. For example, the ImageMagickHandler produces a derivative image and PUTs it back to Drupal, while the JWTHandler verifies tokens issued by Drupal.

Handlers are invoked in a chain according to the order specified in the configuration. This is important for two reasons: 1) To ensure secure processing, the handler which verifies JWT tokens ought to run before another handler that generates a derivative, and 2) state produced by one handler may be passed to the remaining handlers, so there may be a dependency between Handler A and Handler B if Handler B relies on state added by Handler A. The chain may be terminated by any Handler that returns a non-nil error. Otherwise, handlers should generally perform their actions and return a nil error, allowing the remaining handlers in the chain to execute. If a Handler returns a non-nil error, the chain terminates, and the message being processed by the handler chain is negatively acknowledged.

If the handler chain executes without error, the message is acknowledged. If any handler returns an error, the handler chain is terminated and the message is nacked. The broker may attempt redelivery at some future time.

Docker Image

This repository provides a minimal Docker image which provides the binary ./derivative-ms as the ENTRYPOINT, and command line arguments are provided to docker run:

$ docker run --rm local/derivative-ms -host stomp-broker.example.org -user moo -pass moo -queue barn

Motivation

The rewrite comes down to the unpredictable scaling and behavior of the PHP-based Islandora microservices.

Islandora microservices are serial: they process one message at a time from their respective queues until the queues are empty. Aside from taking a long time to process a queue, a large ingest from one of the content administrators could create enough requests in the queue that their JWT tokens expire before the message has a chance to be processed. A work-around is to create messages with JWTs that expire far into the future, but the real solution is to implement JWT renewal and scale up the microservices.

The Islandora microservices can scale in a couple of ways:

  • a single microservice could process multiple messages concurrently
  • n instances of a microservice could be provisioned, each instance processing messages serially.

The former approach requires a potentially expensive cloud instance which may not always be busy. The latter approach could be implemented on smaller compute instances, and n could be raised or lowered based on load.

Alpaca is the component in the Islandora architecture that is responsible for handling messages and dispatching them to the PHP-based microservices. It's based on Apache Karaf, and uses Camel to process messages. Scaling ought to work by creating multiple instances of the PHP-microservices in Docker, and instantiating multiple instances of their respective Camel contexts in Alpaca. This works, kind of. It's clear from the ActiveMQ console that some round-robining of requests occurs, spreading the load across the PHP microservices, but it doesn't behave as expected (e.g. one of the microservices will recieve the majority of the requests, and Alpaca will not immediately remove a message from the queue despite microservice instances being free, ready to work).

Since Karaf and Camel are based on old paradigms, impenatrable logic, and result in behaviors that are hard to understand, the microservices were re-written in Go and eliminate Karaf and Camel from the architecture.

Architecture

Alpaca is not used in this architecture. The microservices in this repository communicate directly with the message broker (ActiveMQ). Reliably scaling them is as easy as starting another instance of the microservice, reading from the same queue. The microservices compete for messages on the queue. If the queue is deep, scale up by increasing the number of microservices. If the queue is shallow, scale down.

The code for all the microservices exists in this repository. Each microservice is implemented as an instance of Handler. Basically handlers respond to messages based on their message destination (i.e. their ActiveMQ queue). So the ImageMagick handler responds to the Houdini queue, and the FFMpegHandler responds to the Homarus queue, and so forth. The Islandora mental model of the "Houdini microservice processes images" or "Homarus processes video" is maintained.

An instance of the microservice can only listen for messages on a single queue. So while the command-line binary possesses the code necessary for handling any message from any queue, a specific instance will only handle messages from a single queue. The only difference between an instance of the Houdini microserivce and the Homarus microservice will be the queue that they listen to.

Message Handling

Islandora microservices are idempotent, so at-least-once messaging semantics are adequate. If a duplicate message is received, the worst thing that happens is the generation of an identical derivative. If a message is lost or rejected, then a derivative (e.g. a thumbnail or service copy) will be missing from the object's page in Islandora.

If a Handler returns an error, then the message will be nacked. Attempts to redeliver the message will be made over the next five minutes, in case the error was transient (or fixed). However, if all redelivery attempts result in error, the message will go to the ActiveMQ dead letter queue (named ActiveMQ.DLQ), and no derivative will be generated. The message is not strictly lost, as it is in the DLQ, but this microservice prototype does not provide any means to process messages in the DLQ. Effectively the DLQ provides a mechanism for observing failures, but doesn't provide means to re-process those messages.

TODOs

There are a number of TODOs, but the prototype is mature enough for demonstration purposes.

  • Debugging output: it would be nice to put a microservice in debug mode and capture stderr. The microservice would have a micro-frontend that would allow viewing of the debug output.
  • Dead letter queue processing: Re-processing messages from the DLQ would be nice, but the best that we may be able to do is output a log message to stdout, surfacing messages to graylog, for example.
  • FITS microservice: the FITS microservice needs to be implemented.
  • JWT refresh: it would be nice to implement JWT refresh. To my knowledge, this is not supported by Drupal, so in effect a "refresh" would mean having Drupal issue a new key to the microservice, which is basically a stand-in for Basic Auth (you'd have to use Basic Auth to get a JWT, so why not just use Basic Auth when communicating with Drupal?). So at this point the best defense against expiring keys is to either scale up the microservices to insure messages are processed within the JWT expiry window, or simply just use Basic Auth when communicating with Drupal, and skip the use of keys. As far as I know, none of the claims provided in the JWT are used by microserivces.
  • It is possible for a message to not be handled by any handler. This results in the message being acked anyway. Bug?
  • Test coverage: there are no tests (eep)
  • Tesseract and pdftotext handlers are not well-exercised and may contain bugs
  • Debugging statements and files (e.g. capture of cli stderr) abound
  • Specify active handlers by key on the command line

About

Derivative microservices based in Go.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published