Skip to content

SimonAuger/data-contracts

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data contracts

Table of Contents

Created by gh-md-toc

Overview

This project intends to document requirements and referential material to implement data contracts in the perspective of data engineering on a modern data stack (MDS).

Data contracts are essential to decouple data producers from data consumers, while having both parties taking responsibility for their respective parts.

Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.

Specifications

Definitions

Definition by Atlan

A data contract outlines how data can get exchanged between two parties. It defines the structure, format, and rules of exchange in a distributed data architecture. These formal agreements make sure that there are not any uncertainties or undocumented assumptions about data.

Definition by Charles Verleyen

Data contracts: API-based agreements

Without high-quality data, every analytics initiative will be underwhelming at best and actively damaging the business at worst. Data contracts are API-based agreements between producers and consumers designed to solve exactly that problem Data Contracts are not a new concept. They are simply new implementations of a very old idea — that producers and consumers should work together to generate high-quality, semantically valid data from the ground up.

Definition by David Jayatillake

In short, a Data Contract is an enforceable agreement on structure and format between the producer and consumer of data. You could even define it in a simpler way: a Data Contract is a guarantee on structure and format by a producer of data.

Definition by Andrew James

Almost all data platforms start with a change data capture (CDC) service to extract data from an organisations transactional databases - the source of truth for their most valuable data. That data is then transformed, joined, and aggregated to drive analysis, modelling, and other downstream services.

However, this data has not been designed for these use cases - it has been designed to satisfy the needs and requirements of the source applications and their day-to-day use. It can then take significant effort to transform, deduce, and derive the data in order to make it useful for downstream use cases.

Furthermore, breaking changes and data migrations will be a regular part of the applications evolution and will be done without knowledge of how it has been used downstream, leading to breaking changes affecting key reports and data-driven products.

For downstream users to confidently build on this valuable data, they need to know the data they are using is accurate, complete, and future proof. This is the data contract.

References

Use cases

Web sites, blogs

Data contracts for the warehouse on Substack

Data products, Chad Sanderson on Substack

Data contracts demystified, by Atlan

  • Title: Data Contracts: The Key to Scaling Distributed Data Architecture and Reducing Data Chaos
  • Date: April 2023
  • Link to the web site: https://atlan.com/data-contracts/

Collection of articles

Astrafy end-to-end implementation of data contracts

Awesome data contracts

Books and articles

Data Contracts: the Mesh Glue

Data contracts for non-tech readers

Driving Data Quality with Data Contracts

Tables as Interfaces

DBT Model Contracts: Importance and Pitfalls

DBT implementing data contracts

Excerpts

dbt Core v1.5 is slated for release at the end of April, and it will include three new constructs:

  • Access: Choose which models ought to be “private” (implementation details, handling complexity within one team or domain) and “public” (an intentional interface, shared with other teams). Other groups and projects can only ref a model — that is, take a critical dependency on it — in accordance with its access.
  • Contracts: Define the structure of a model explicitly. If your model’s SQL doesn’t match the specified column names and data types, it will fail to build. Breaking changes (removing, renaming, retyping a column) will be caught during CI. On data platforms that support build-time constraints, ensure that columns are not null or pass custom checks while a model is being built, in addition to more flexible testing after.
  • Versions: A single model can have multiple versioned definitions, with the same name for downstream reference. When a mature model with an enforced contract and public access needs to undergo a breaking change, rather than breaking downstream queriers immediately, facilitate their migration by bumping the version and communicating a deprecation window.

In the future, individual teams will own their own data. Data engineering will own “core tables” or “conformed dimensions” that will be used by other teams. Ecommerce will own models related to site visits and conversion rate. Ops will own data related to fulfillment. Etc. Each of these teams will reference the public interfaces exposed by other teams as a part of their work, and periodically release upgrades as versions are incremented on upstream dependencies. Teams will review PRs for their own models, and so have more context for what “good” looks like. Monitoring and alerting will happen in alignment with teams and codebases, so there will be real accountability to delivering a high quality, high reliability data product. Teams will manage their own warehouse spend and optimize accordingly. And teams will be able to publish their own metrics all the way into their analytics tool of choice.

PayPal open sources its data contract templates

Data contracts, the missing foundation

An engineering guide to data creation and data quality, a data contract perspective

Data contracts for the warehouse

Data contracts wrapped 2022

Data contracts in practice

Anatomy of a Data Product

An Engineer's guide to Data Contracts

The production-grade Data Pipeline

Yet another post on Data Contracts

Fine, let us talk about data contracts

Data contracts - From zero to hero

Contracts have consequences

Data Person: Attorney At Law

The rise of data contracts

Interfaces and breaking stuff

Implementing Data Contracts: 7 Key Learnings

Shifting left on governance: DataHub and schema annotations

Data contracts at GoCardless, 6 months on

Improving data quality with data contracts

Tools and frameworks

Schemata

OpenDataMesh

PayPal data contract templates

PolyExpose: a simplistic Polyglot data tool

  • Homepage: https://github.com/velascoluis/polyexpose
  • Prototype, simplistic Python package implementing the following concepts
    • The ultimate goal to ensure reusability, Data mesh answer is to introduce the concept of polyglot data, an abstraction to clearly differentiate between the data semantics and the data consumption format/syntax.
    • This is a very elegant approach with a very clear separation of responsibilities between semantics and its underlying technology, but as Data Mesh does not prescribe any kind of technical architecture, sometimes this can be challenging to visualize or implement.
    • The idea of this repository is to present a potential technology architecture that implements this pattern using as many open source components as possible
  • Main contributors: Luis Velasco (Luis Velasco on LinkedIn, Luis Velasco on Medium, Luis Velasco on GitHub)
  • See also Data Contracts: the Mesh Glue (in this page)

Kolle

Zero/Low code based business model representation automation. Kolle is for working on data models, data-contract, data quality, data profiling, and data linage instead of technical tooling or platform.

Smithy

Avro / Schema Registry

Support by cloud vendors

Protocol buffers (Protobuf)

Buz

  • Buz homepage
  • GitHub - Buz
  • Overview: Buz is a system for collecting events from various sources, validating data quality, and delivering them to where they need to be.

Benthos

  • Benthos homepage
  • GitHub - Benthos
  • Overview: Benthos is a high performance and resilient stream processor, able to connect various sources and sinks in a range of brokering patterns and perform hydration, enrichments, transformations and filters on payloads.

Memphis

  • Memphis homepage
  • GitHub - Memphis
  • Overview: A simple, robust, and durable cloud-native message broker wrapped with an entire ecosystem that enables cost-effective, fast, and reliable development of modern queue-based use cases. Memphis enables the building of modern queue-based applications that require large volumes of streamed and enriched data, modern protocols, zero ops, rapid development, extreme cost reduction, and a significantly lower amount of dev time for data-oriented developers and data engineers.

API specifications

Schema.org

Vendor solutions

DBT

AWS

Google

Collibra

AWS

DataMesh Manager

Exploration / Proof-of-Concept (PoC)

About

Food for thoughts around data contracts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 42.7%
  • Smithy 32.7%
  • CSS 7.5%
  • Kotlin 6.2%
  • Shell 6.0%
  • Makefile 4.9%