-
Notifications
You must be signed in to change notification settings - Fork 3
Back end Infrastructure
The back-end infrastructure for DAILP is built from a few different layers.
- Data storage: MongoDB three-node replica set deployed on AWS EC2
- Data access: GraphQL server written in Rust deployed on AWS Lambda
- Authentication: AWS Cognito user pool, allowing sign in and authenticated actions
- Data migration: Rust code that runs in GitHub Actions on-demand
- Web hosting: AWS Amplify deployment
We use Terraform for all infrastructure provisioning and back-end deployment because it flexibly handles complex deployments across multiple environments. Infrastructure deployments run in GitHub Actions on every commit to the main
and dev
git branches. These deploy to the library-prod
and library-dev
AWS accounts, respectively.
Builds are driven by Nix, a purely functional package manager, because it makes builds perfectly reproducible across machines with pinned package versions and immutable build outputs. Nix makes building and running the exact same program in many places exceedingly easy. The build process is defined by flake.nix
at the root of the repository, and terraform/main.nix
is the entry-point to our Terraform module.
We also use terranix to unify our usage of Nix with Terraform, so that contributors only have to learn the Nix expression language rather than both Nix expressions and the Terraform DSL.
Finally, the MongoDB servers we deploy on EC2 run NixOS, a Linux distribution powered by the Nix package manager, whose entire state can be represented in one config file. Any configuration changes are executed with a rolling upgrade that rolls back if failed.
Our data access server running GraphQL is deployed on AWS Lambda. This means that it isn't constantly running, but only runs when there are active requests. Multiple instances may run at once to meet demand. Code changes in the types
or graphql
directories will trigger a rebuild of the GraphQL server and deploy it when the changes are committed to dev
or main
.
The following is a list of resources that I've used while setting up the infrastructure on this project.
- The Rust Book: One of the best starting resources for learning the Rust programming language
- 1 page intro to Nix expression language
- Introduction to Nix flakes
- Deploying NixOS using Terraform
- You don't need that bastion host: Blog post describing how to secure AWS EC2 machines in a public subnet rather than routing SSH through a "bastion host"
- Terraform doesn't rely on AWS CloudFormation, meaning we can use the same deployment process for other service providers than AWS.
- Wide community support and tons of modules, especially compared to the Serverless framework. Serverless is easy at first, but doesn't have that many extension points for infrastructure deployments.
- We use Nix for builds, meaning that it does change detection for us and each build is paired with a unique identifying hash. Because of this, we don't need our deployment framework to detect changes and build our code (Serverless does this). Each unique build will be at a different path, so we can just point Terraform to the newest build path. Based on just that information, Terraform redeploys exactly the resources that need to be updated.
Using Nix and NixOS secures many guarantees for a project.
- We can rely on reproducible deployments across machines
- Deploying to a new environment called
staging
(in addition todev
andprod
) would be little more manual effort than making a new Git branch. - New contributors only have to install Nix on their system to start developing, and then Nix will handle getting the right Rust version (and associated tooling) and making a development shell to work in. This eliminates any confusion about needing certain dependency versions or programs running differently on MacOS vs Linux, etc.
- We don't need something like Docker because NixOS already ensures exact package versions and consistently identical deployments. If we need containers, NixOS provides native containers that are managed alongside top-level system configuration.
- The schema-less model allows us to enforce validation and strict structure in back-end code, making data definition and migration more flexible.
- There is a lot of tooling for MongoDB since it is widely used.
- The document-oriented model facilitates building and querying nested structures of any depth. This maps neatly onto functional data structures compared to SQL tables.
- Data is stored in binary JSON, which is a format that already has tons of tools to manipulate with.
- Mongo databases can be split into many nodes which replicate and/or shard data, balancing throughput across nodes.
The following is my notes on several alternative data storage solutions we have considered and not used thus far.
-
Online Linguistic Database
- DAILP started out with the OLD, but we grew out of its limitations
- Oriented toward fieldwork, storing forms as an unstructured SQL table, leaving out tools necessary for dealing with structured historical texts and associated multimedia
- We don't get much out of Dative, a web GUI for managing OLD data, because it doesn't facilitate community collaboration.
- We do take inspiration from some of the internal structure of forms in OLD, since they are well thought out by a field linguist, but shift the focus to dealing with texts and connecting lexical sources together.
- Relies on a SQL database, carrying any benefits or caveats that may entail.
-
ArangoDB: I have considered switching to ArangoDB because
- It is multi-model, natively supporting both nested document structures like MongoDB does, and graph relations, unique to this type of database. This could make modeling community contributions, collaboration, and word relationships easier than with MongoDB or Postgresql.
- ArangoDB does have less userbase, but it is very straightforward to learn once you understand MongoDB and are comfortable with JSON.
- It uses its own query language, AQL, which is pretty similar to SQL but using JSON, branching constructs familiar from JavaScript, and native graph traversal queries. This is more powerful than MongoDB's queries and aggregations, but may have slightly more learning curve.
- Schemas are supported, using standard JSON schema.
- Compared to MongoDB, it has a more solid open source commitment.
- Tooling: Rust ODM for ArangoDB. NixOS service example
-
PostgreSQL: The most popular open-source database, built on SQL.
- It is relational, which also means that it favors a flat structure with many tables that refer to each-other. Nested structures, especially of arbitrary depth, are more difficult to model in this system.
- JSON fields are supported, but they are second-class and require special handling to query and update.
- Schema management is required, and table fields have stricter requirements than in MongoDB. Since we're routing all DB management through Rust, we can rely on that code to enforce type safety and consistency, especially since that's where data migrations will run from anyway.
- NoSQL makes minor type changes like field additions and removals fairly trivial by obviating the need for an explicit migration, whereas PostgreSQL would need explicit migration for any/all data model changes.
- Tooling: ORM for Rust
-
BaseX: XML database engine and associated tooling.
- This choice would only make sense if we are using XML across the stack, because it otherwise doesn't fit well into other DB models.
- One queries a BaseX instance using XQuery, which is great if you're already using XQuery for manipulating XML documents.
- Validates everything with XML schemas, which again is fantastic if that's already part of your stack. Starting a new project that will need to deal with various kinds of user data, it made more sense to go with a more established database platform that has better library support from modern programmatic tooling, like Rust, GraphQL, and JavaScript.