Data Workspace - a PostgreSQL-based open source data analysis platform
This is the entry-point repository for Data Workspace, a PostgreSQL-based open source data analysis platform with features for users with a range of technical skills. It contains a brief catalogue of all Data Workspace repositories (below), the source for the Data Workspace developer documentation, and the Terraform code to deploy Data Workspace into AWS.
Tip
Looking for the Data Workspace Django application? It's now in the data-workspace-frontend repo.
The components of Data Workspace are stored across several Git repositories.
-
data-workspace (this repository)
Contains the Terraform code to deploy Data Workspace in AWS, and the public facing developer documentation for Data Workspace. See Contents of this repository for details of what goes where.
-
Contains the core Django application the defines the most user-facing components of Data Workspace. Also contains "the proxy" that sits in front of the Django application that integrates with SSO and routes requests, for example to tools.
Also contains the Dockerfiles for other components. However, it's planned to move these out to separate repositories.
-
Contains the definitions of the on-demand tools that users can launch in Data Workspace.
-
Contains the definitions of MLFlow, an MLOps tool.
-
Contains the definitions of Superset, a dashboarding tool.
-
Contains the definitions of GitLab, which stores code and run CI pipelines.
-
Contains the definitions of ArangoDB, a graph database
Some of the components of Data Workspace are lower level, and less Data Workspace-specific - they can at least theorically be re-used outside of Data Workspace
-
Used to synchronise permissions between the data-workspace-frontend metadata database and users in the main PostgreSQL database.
-
Used in on-demand tools to sync user's files with S3
-
Used in tools in order to filter and re-write DNS requests
-
Used in Theia to give reasonably straightforward access to a PostgreSQL database
-
mirror-git-to-s3
git-lfs-http-mirrorUsed to mirror git repositories that use Large File Storage (LFS) to S3 and to then access them from inside tools.
-
Used to deploy Data Workspace from Jenkins
-
quicksight-bulk-update-datasets
A CLI script to make bulk updates to Amazon Quicksight datasets
These components are usually used to ingest data into the PostgreSQL database that's the core of Data Workspace
-
pg-bulk-ingest
pg-force-executeUsed to ingest large amounts of data in the PostgreSQL database
-
Used in serveral ways to convery from iterables of bytes to a file-like object for memory-efficient data ingestion. For example when parsing CSVs.
-
Used to extract data from archives in a format that requires running an external program.
-
Used to extract data from Open Document Spreadsheet (ODS) files in a memory-efficient and disk-efficient way.
-
Used to extract data from ZIP files in a memory-efficient and disk-efficient way.
-
Used to ingest data from Companies House.
-
Used to generate large and complex SQLite files that are then ingested into the Data Workspace PostgreSQL database.
-
Used to power a simple API to accept incoming data files in any format and drop it in S3, subsequently ingested into Data Workspace.
These components are used when publishing data from Data Workspace.
-
Makes data available to the public.
-
Creates ZIP files in a memory-efficient and disk-efficient way.
-
Creates Open Document Spreadsheet (ODS) files in a memory-efficient and disk-efficient way.
-
Part of the system that makes data available to other internal applications.
-
The GitHub actions workflows for this repository.
-
deploy-docs-to-github-pages.yml
On change of the main branch (such as a merge of a PR) it builds the developer documentation in docs/, pushes it to GitHub pages, and surfaces it at https://data-workspace.docs.trade.gov.uk/
-
On any PR against the main branch, or change of the main branch, it runs linting checks against the Terraform code to make sure it is consistently formatted.
-
-
A list of file patterns that are not committed to this repository by default during local development. For example it contains the patterns that match temporary files created by Terraform when run locally, or the built documentation when building the documentation locally.
-
The source of the Data Workspace developer documentation. The documentation is built using the node-based Eleventy static site generator and the X-GOVUK govuk-eleventy-plugin in order to use the GOV.UK design system.
The built documentation is hosted on GitHub pages.
-
package-lock.json
package.json
eleventy.config.jsSupporting files for building the Data Workspace developer documentation. The
package.json
file has the list of direct dependencies,package-lock.json
has specific versions of all the direct and transitive node dependencies, andeleventy.config.js
contains the configuration. -
The Terraform source to build the infrastructure of Data Workspace in Amazon Web Services (AWS).
-
The source of the file you're currently reading.
-
The list of code owners that can approve pull requests in this repository.