-
Notifications
You must be signed in to change notification settings - Fork 45
/
technical-knowledgebase.qmd
93 lines (76 loc) · 6.65 KB
/
technical-knowledgebase.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
title: "Technical Knowledge Base"
description: |
"This section provides a comprehensive overview of various technologies and tools in the field of data engineering."
---
## Relational Databases
- **[PostgreSQL](https://www.postgresql.org)**: An open-source object-relational database system known for its reliability and robust features.
- **[MySQL](https://www.mysql.com)**: The most popular open-source SQL database management system, developed by Oracle Corporation.
- **[Amazon Relational Database System (RDS)](https://aws.amazon.com/rds/)**: A managed service for setting up and scaling databases in the cloud, supporting multiple database engines.
## Columnar Databases
- **[Amazon Redshift](https://aws.amazon.com/redshift/)**: A cloud data warehouse known for its fast performance, available in both provisioned and serverless configurations.
- **[Google BigQuery](https://cloud.google.com/bigquery)**: A serverless enterprise data warehouse that offers high scalability and cost-effectiveness.
## Key-value Stores
- **[Redis](https://redis.io)**: An open-source, in-memory key-value store known for its versatility and performance.
- **[Amazon DynamoDB](https://aws.amazon.com/dynamodb/)**: A fully managed, serverless NoSQL database designed for modern applications.
## Object Storage
- **[Amazon S3](https://aws.amazon.com/s3/)**: Offers scalable and secure object storage services.
- **[Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs)**: Ideal for storing large-scale cloud-native workloads, archives, and data lakes.
- **[Google Cloud Storage](https://cloud.google.com/storage)**: A flexible service for storing and retrieving any amount of data.
## Data Ingestion
- **[Apache Kafka](https://kafka.apache.org/)**: A distributed event streaming platform with various implementations:
- [Confluent's Apache Kafka](https://www.confluent.io): A fully managed service with expert support.
- [Amazon MSK](https://aws.amazon.com/msk/): Amazon's fully managed Kafka service.
- **[AWS SDK for pandas (AWS Wrangler)](https://github.com/aws/aws-sdk-pandas)**: Extends pandas library to AWS, allowing seamless integration with AWS data services.
- **[AWS Kinesis](https://aws.amazon.com/kinesis/)**: A cloud-based service for real-time data processing.
- **[Airbyte](https://github.com/airbytehq/airbyte)**: An open-source data integration platform for ELT pipelines.
- **[Pentaho Data Integration (Kettle)](https://www.hitachivantara.com/en-us/products/pentaho-platform/data-integration-analytics.html)**: Includes both a core data integration engine and a graphical user interface for defining jobs and transformations.
## Data Formats
- **[Apache Avro](https://avro.apache.org)**: A serialization format ideal for streaming data pipelines.
- **[Apache Parquet](https://parquet.apache.org)**: An open-source columnar data file format.
- **[Apache ORC](https://orc.apache.org)**: Optimized for Hadoop workloads.
## Data Storage Framework
- **[Delta Lake](https://delta.io)**: An open-source storage framework by Databricks for building Lakehouse architecture.
- **[Apache Iceberg](https://iceberg.apache.org)**: An open table format for huge analytical datasets, developed by Netflix.
- **[Apache Hudi](https://hudi.apache.org)**: A framework for managing large analytical datasets, created by Uber.
## Batch Processing
### Frameworks and Libraries
- **[Apache Spark](https://spark.apache.org)**: A versatile engine for big data processing, available in various languages like Python (PySpark), Scala, Java, and R.
### SQL Engines
- **[Presto](http://prestodb.io)**: A distributed SQL query engine for big data.
- **[Apache Hive](https://hive.apache.org)**: Built on Hadoop, it facilitates reading, writing, and managing large datasets.
- **[Apache Drill](https://drill.apache.org)**: An open-source SQL query engine.
- **[Trino](https://trino.io)**: A SQL query engine designed for large data sets.
### Managed Services (Cloud)
- **[AWS Elastic MapReduce (EMR)](https://aws.amazon.com/emr/)**: A cloud big data platform for processing large datasets.
- **[AWS Glue](https://aws.amazon.com/glue/)**: A serverless data integration service.
## Stream Processing
- **[Spark Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html)**: A part of Apache Spark for processing live data streams.
- **[Spark Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)**: A stream processing engine built on Spark SQL.
- **[Apache Flink](https://flink.apache.org)**: A framework for stateful computations over data streams.
- **[Apache Storm](https://storm.apache.org/about/integrates.html)**: A system for processing streaming data in real-time.
### Data Stores
- **[Apache Druid](https://druid.apache.org)**: A high-performance real-time analytics database.
- **[Apache Pinot](https://pinot.apache.org)**: A real-time distributed OLAP datastore.
## Workflow Orchestration
- **[Apache Airflow](https://airflow.apache.org)**: An open-source platform for managing complex computational workflows and data processing pipelines.
- **[Mage](https://www.mage.ai)**: A modern replacement for Airflow for transforming and integrating data.
- **[Dagster](https://dagster.io)**: An orchestration platform for data assets.
- **[Prefect](https://www.prefect.io)**: A workflow orchestration tool for data pipelines.
- **[Kestra](https://kestra.io)**: An orchestrator for both scheduled and event-driven workflows.
- **[AWS Step Functions](https://aws.amazon.com/step-functions/)**: Coordinates components of distributed applications.
## Data Transformation
### Frameworks
- **[Data Build Tool (dbt)](https://www.getdbt.com)**: A transformation workflow that follows software engineering best practices.
- **[SQLMesh](https://sqlmesh.readthedocs.io/en/stable/)**: An open-source data transformation framework for SQL and Python.
## Data Governance
### Enterprise Data Catalog
- **[DataHub Project](https://datahubproject.io)**: An extensible metadata platform by LinkedIn.
- **[OpenMetadata](https://open-metadata.org)**: A platform for discovering, collaborating, and managing data.
- **[Apache Atlas](https://atlas.apache.org/#/)**: An open-source metadata and governance framework.
- **[Amundsen](https://www.amundsen.io)**: A data discovery and metadata engine by Lyft.
### Data Quality/Observability
- **[Great Expectations](https://greatexpectations.io)**: A platform for data quality and observability.
## Data Platforms
- **[Databricks](https://www.databricks.com)**: Offers a unified Data Lakehouse platform.
- **[Snowflake](https://www.snowflake.com/en/)**: A cloud-native data warehouse platform.