Skip to content
Jonathan Meyer edited this page Sep 4, 2019 · 4 revisions

Scale captures and tracks a considerable volume of job and system metadata as a part of processing. This data ultimately resides in PostgreSQL and is updated via the Message Handler and Scheduler components. Using PostgreSQL along with PostGIS allows us to support spatial data storage and complex JSON blob querying within our relational data store.

Our deployment includes a default deployment of Postgres using a community Docker image for demonstration purpose. This Postgres instance does not have any storage persistence or failover support making it unsuitable for any long-term use. For production use, we recommend use of a managed Postgres offering such as Amazon RDS or Azure Database. Configuration of a Postgres database can be a very involved process with considerable room for optimization as the size of your database grows.

We are going to focus on the configuration of Scale to connect to various backends. There is one configuration setting used by the Scale deployment to configure the various system components: environment variable DATABASE_URL. This environment variable follows the syntax outlined at dj-database-url project. In short, it follows the form:

postgis://user:password@host[:port]/name

Note: It is critical to be aware that the user must have full access to a scale and silo schema within the defined database name where all respective tables will be created. PostGIS extension is also required in the database.

Marathon hosted Postgres

While Scale will deploy a local Postgres cluster automatically during launch if DATABASE_URL is unset, this should never be relied on for anything beyond demonstration purposes. The default Postgres deployment has no mounted persistent storage, so all Scale configuration and data will be lost if there is a container restart.

The following sample marathon.json would be a reasonable starting point to provide:

{
  "id": "/scale-persistent-db",
  "instances": 1,
  "mem": 512,
  "gpus": 0,
  "cpus": 0.5,
  "disk": 0,
  "container": {
    "portMappings": [
      {
        "containerPort": 5432,
        "labels": {
          "VIP_0": "//scale-persistent-db:5432"
        },
        "protocol": "tcp"
      }
    ],
    "type": "DOCKER",
    "volumes": [
      {
        "persistent": {
          "size": 10240
        },
        "mode": "RW",
        "containerPath": "/var/lib/postgresql/data"
      }
    ],
    "docker": {
      "image": "mdillon/postgis:9.5-alpine",
      "forcePullImage": true
    }
  },
  "networks": [
    {
      "mode": "container/bridge"
    }
  ],
  "env": {
    "POSTGRES_DB": "scale",
    "POSTGRES_PASSWORD": "scale-pass",
    "POSTGRES_USER": "scale-user"
  },
  "healthChecks": [
    {
      "gracePeriodSeconds": 300,
      "intervalSeconds": 30,
      "maxConsecutiveFailures": 3,
      "portIndex": 0,
      "timeoutSeconds": 15,
      "delaySeconds": 15,
      "protocol": "MESOS_TCP",
      "ipProtocol": "IPv4"
    }
  ],
  "residency": {
    "relaunchEscalationTimeoutSeconds": 10,
    "taskLostBehavior": "WAIT_FOREVER"
  }
}

The above configuration will generate a persistent storage volume (10GiB) and pin Postgres to that node. This will protect you from data loss as long as that node remains in your cluster. Setting up a truly fault tolerant, local Postgres cluster is outside the scope of this guide.

To configure Scale to use the Postgres deployed as described above we need to set environment variables as below:

Clone this wiki locally