Skip to content

System Design, Solution Architecture, Data Systems Practice

Notifications You must be signed in to change notification settings

sonhmai/data-system-design

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Data Systems Design and Implementation

(also available in Papers folder)

 can describe following terms
- state machine replication within a shard
- two-phase locking for serializable transaction isolation
- two-phase commit for cross-shard atomicity

Terms

  • Table Formats (Iceberg, Hudi, Delta, OneTable, etc.).

  • Features (ACID transaction, Schema Evolution, Upsert, Time Travel, etc.).

  • Tech ( Copy On Write vs Merge On Read, Compaction, Vacuum, OCC (Optimistic Concurrency Control), MVCC (Multiversion Concurrency Control) )

Blogs and Videos

RFCs

Blogs

Blogs

Papers

  • Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine (apache/datafusion#6782). Written in Rust, uses Apache Arrow as memory model.

Projects

Metadata management, data quality, data veracity, data security, data lineage, etc.

Blogs

Papers

  • [VLDB, Amazon - Automating Large-Scale Data Quality Verification](https://www.vldb.org/pvldb/vol11/p1781-schelter.pdf). It presents the design choices and architecture of a production-grade system for checking data quality at scale, shows the evaluation result on some datasets.

Best Practices

  • too little data quality alerts let important issues go unresolved.

  • too many alerts overwhelms and might make the most important ones go un-noticed.

  • statistical modeling techniques (PCA, etc.) can be used to reduce computation resource for quality checks.

  • separate anomaly detection from anomaly scoring and alerting strategy.

Common Issues

  • issues in metadata category (data availability, data freshness, schema changes, data completeness) → can be obtained without checking dataset content

  • issues in semantic category(dataset content: column value nullability, duplication, distribution, exceptional values, etc.) → needs data profiling

Blogs

GDPR, CCPA, PII Protection, etc.

Observability, Monitoring

Incidents

Architecture

Data Pipelines

Spark and Databricks Compute

Delta Lake

  • [Managing Recalls with Barcode Traceability on the Delta Lake](https://www.databricks.com/blog/managing-recalls-barcode-traceability-delta-lake)

  • [Creating a Spark Streaming ETL pipeline with Delta Lake at Gousto](https://medium.com/gousto-engineering-techbrunch/creating-a-spark-streaming-etl-pipeline-with-delta-lake-at-gousto-6fcbce36eba6)

    • issues and solutions

      • costly Spark op MSCK REPAIR TABLE because it needs to scan table' sub-tree in S3 bucket. → use ALTER TABLE ADD PARTITION instead.

      • not caching dataframes for multiple usages. → use cache

      • rewriting all destination table incl. old partitions when having a new partition. → append new partition to destination.

      • architecture (waiting for CI, Airflow triggering, EMR spinning up, job run, working with AWS console for logs) slowing down development. Min feedback loop of 20 minutes. → move away from EMR, adopt a platform allowing to have complete control of clusters and prototyping.

    • Databricks Pros

      • Reducing ETL time, latency from 2 hours to 15s by using streaming job and delta architecture.

      • Spark Structured Streaming Autoloader helps manage infra (setting up bucket noti, SNS and SQS in the background).

      • Notebook helps prototype on/ explore production data, debug with traceback and logs interactively. Then CICD to deploy when code is ready. This helps reduce dev cycle from 20 mins to seconds.

      • Costs remain the same as before Databricks. (using smaller instances with streaming cluster, which compensated for DBx higher costs vs EMR).

      • Reducing complexity in codebase and deployment (no Airflow).

      • Better ops: performance dashboards, Spark UI, reports.

    • Other topics: DBT for data modeling, Redshift, SSOT.

  • [Data Modeling Best Practices & Implementation on a Modern Lakehouse](https://www.databricks.com/blog/data-modeling-best-practices-implementation-modern-lakehouse)

Governance

Backfilling

Papers

  • NanoLog: A Nanosecond Scale Logging System. github repo

    • Implemented fast, low latency, high thruput C++ logging system.

      • 10-100x faster than existing systems (Log4js, spdlog)

      • maintains printf-like semantics

      • 80M msg/sec at median latency of 8ns in microbenchmarks, 18ns in apps

    • How?

      • shifting work out of the runtime hot path and into the compilation and post-execution phases of the application.

      • deferring formatting to an offline process.

    • Benefit: allows detailed logs in low latency systems

    • costs: 512KB RAM per thread, one core, disk BW.

  • How Rocksdb Works

  • SREcon22 - Dissecting the Humble LSM Tree and SSTable. In this talk on YouTube, we will start from the ground up, describing what the LSM data structure is and implementing it using code. We’ll then talk about the operational characteristics and how they are used in practice for high performance applications that need to store data.

  • Redesigning OLTP for a New Order of Magnitude. A InfoQ talk where Joran Greef discusses TigerBeetle, a new database, and why OLTP has a growing impedance mismatch, why the OLTP workload is becoming more contentious, why row locks, why storage faults, write stalls, and why non-determinism is now a problem.

Releases

No releases published

Packages

No packages published