Skip to content

data-engineering-helpers/data-contracts

Repository files navigation

Data contracts

Table of Contents

Created by gh-md-toc

Overview

This project intends to document requirements and referential material to implement data contracts in the perspective of data engineering on a modern data stack (MDS).

Data contracts are essential to decouple data producers from data consumers, while having both parties taking responsibility for their respective parts.

Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.

Specifications

Definitions

Definition by Andrew Jones

A data contract is an agreed interface between the generators of data and its consumers. It sets the expectations around that data, defines how it should be governed, and facilitates the explicit generation of quality data that meets the business requirements.

Definition by Atlan

A data contract outlines how data can get exchanged between two parties. It defines the structure, format, and rules of exchange in a distributed data architecture. These formal agreements make sure that there are not any uncertainties or undocumented assumptions about data.

Definition by Charles Verleyen

Data contracts: API-based agreements

Without high-quality data, every analytics initiative will be underwhelming at best and actively damaging the business at worst. Data contracts are API-based agreements between producers and consumers designed to solve exactly that problem Data Contracts are not a new concept. They are simply new implementations of a very old idea — that producers and consumers should work together to generate high-quality, semantically valid data from the ground up.

Definition by Jean-Georges Perrin

A data contract acts as an agreement between multiple parties; specifically, a data producer and its consumer(s). A data contract:

  • Creates a link between data producers and data consumers.
  • Creates a link between a logical representation of the data and its physical implementation.
  • Describes “meta meta” data: rules, quality, and behavior (yes, there are two metas in this sentence).

Definition by David Jayatillake

In short, a Data Contract is an enforceable agreement on structure and format between the producer and consumer of data. You could even define it in a simpler way: a Data Contract is a guarantee on structure and format by a producer of data.

References

Use cases

Web sites, blogs

Data contracts for the warehouse on Substack

Data products, Chad Sanderson on Substack

Data contracts demystified, by Atlan

  • Title: Data Contracts: The Key to Scaling Distributed Data Architecture and Reducing Data Chaos
  • Date: April 2023
  • Link to the web site: https://atlan.com/data-contracts/

Bitol organization

Collection of articles

Astrafy end-to-end implementation of data contracts

Awesome data contracts

Books and articles

Illustrated Guide to Data Products in Action

Data as a Product and Data Contract

Implementing Data Mesh

Data Contract 101

Data Contracts: the Mesh Glue

Data contracts for non-tech readers

Driving Data Quality with Data Contracts

Tables as Interfaces

DBT Model Contracts: Importance and Pitfalls

DBT implementing data contracts

Excerpts

dbt Core v1.5 is slated for release at the end of April, and it will include three new constructs:

  • Access: Choose which models ought to be “private” (implementation details, handling complexity within one team or domain) and “public” (an intentional interface, shared with other teams). Other groups and projects can only ref a model — that is, take a critical dependency on it — in accordance with its access.
  • Contracts: Define the structure of a model explicitly. If your model’s SQL doesn’t match the specified column names and data types, it will fail to build. Breaking changes (removing, renaming, retyping a column) will be caught during CI. On data platforms that support build-time constraints, ensure that columns are not null or pass custom checks while a model is being built, in addition to more flexible testing after.
  • Versions: A single model can have multiple versioned definitions, with the same name for downstream reference. When a mature model with an enforced contract and public access needs to undergo a breaking change, rather than breaking downstream queriers immediately, facilitate their migration by bumping the version and communicating a deprecation window.

In the future, individual teams will own their own data. Data engineering will own “core tables” or “conformed dimensions” that will be used by other teams. Ecommerce will own models related to site visits and conversion rate. Ops will own data related to fulfillment. Etc. Each of these teams will reference the public interfaces exposed by other teams as a part of their work, and periodically release upgrades as versions are incremented on upstream dependencies. Teams will review PRs for their own models, and so have more context for what “good” looks like. Monitoring and alerting will happen in alignment with teams and codebases, so there will be real accountability to delivering a high quality, high reliability data product. Teams will manage their own warehouse spend and optimize accordingly. And teams will be able to publish their own metrics all the way into their analytics tool of choice.

PayPal open sources its data contract templates

Data contracts, the missing foundation

An engineering guide to data creation and data quality, a data contract perspective

Data Contracts Using Schema Registry

Data contracts for the warehouse

Need for an Open Standard for the Semantic Layer

Data contracts wrapped 2022

Data contracts in practice

Anatomy of a Data Product

An Engineer's guide to Data Contracts

The production-grade Data Pipeline

Yet another post on Data Contracts

Fine, let us talk about data contracts

Data contracts - From zero to hero

Contracts have consequences

Data Person: Attorney At Law

The rise of data contracts

Interfaces and breaking stuff

Implementing Data Contracts: 7 Key Learnings

Shifting left on governance: DataHub and schema annotations

Data contracts at GoCardless, 6 months on

Improving data quality with data contracts

Tools and frameworks

Schemata

OpenDataMesh

Datacontract.com specification and CLI

Bitol - Open Data Contract Standard (ODCS)

PayPal data contract templates

PolyExpose: a simplistic Polyglot data tool

  • Homepage: https://github.com/velascoluis/polyexpose
  • Prototype, simplistic Python package implementing the following concepts
    • The ultimate goal to ensure reusability, Data mesh answer is to introduce the concept of polyglot data, an abstraction to clearly differentiate between the data semantics and the data consumption format/syntax.
    • This is a very elegant approach with a very clear separation of responsibilities between semantics and its underlying technology, but as Data Mesh does not prescribe any kind of technical architecture, sometimes this can be challenging to visualize or implement.
    • The idea of this repository is to present a potential technology architecture that implements this pattern using as many open source components as possible
  • Main contributors: Luis Velasco (Luis Velasco on LinkedIn, Luis Velasco on Medium, Luis Velasco on GitHub)
  • See also Data Contracts: the Mesh Glue (in this page)

SQLMesh

Nessie

Kolle

Zero/Low code based business model representation automation. Kolle is for working on data models, data-contract, data quality, data profiling, and data linage instead of technical tooling or platform.

Smithy

Avro / Schema Registry

Support by cloud vendors

Protocol buffers (Protobuf)

Buz

  • Buz homepage
  • GitHub - Buz
  • Overview: Buz is a system for collecting events from various sources, validating data quality, and delivering them to where they need to be.

Benthos

  • Benthos homepage
  • GitHub - Benthos
  • Overview: Benthos is a high performance and resilient stream processor, able to connect various sources and sinks in a range of brokering patterns and perform hydration, enrichments, transformations and filters on payloads.

Memphis

  • Memphis homepage
  • GitHub - Memphis
  • Overview: A simple, robust, and durable cloud-native message broker wrapped with an entire ecosystem that enables cost-effective, fast, and reliable development of modern queue-based use cases. Memphis enables the building of modern queue-based applications that require large volumes of streamed and enriched data, modern protocols, zero ops, rapid development, extreme cost reduction, and a significantly lower amount of dev time for data-oriented developers and data engineers.

API specifications

Schema.org

Vendor solutions

DBT

AWS

Google

Collibra

AWS

DataMesh Manager

Exploration / Proof-of-Concept (PoC)

About

Food for thoughts around data contracts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published