Skip to content

Commit

Permalink
updates
Browse files Browse the repository at this point in the history
Signed-off-by: (Bit-Mage) <[email protected]>
  • Loading branch information
(Bit-Mage) committed Oct 31, 2024
1 parent 4c9b14c commit e675386
Show file tree
Hide file tree
Showing 20 changed files with 226 additions and 16 deletions.
48 changes: 36 additions & 12 deletions Content/20230717135201-data_engineering.org
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
#+filetags: :data:

* Stream
** 0x22F4
- see [[id:869abfbd-031b-40a0-9c4b-69c3e7d820ab][real-time data streaming]] and [[id:f4135d2f-3390-4d76-b05a-222f910c10d4][batch computing]]
** 0x22F2
- see [[id:1656ed9e-9ed0-4ddb-9953-98189f6bb42e][Extract, Transform, Load]]
- see [[id:710e11f8-780a-4aa5-84fc-c0ab9bb848c0][Big Data]]
Expand All @@ -20,30 +22,52 @@
** Data Engineering Lifecycle
#+begin_src plantuml :file ./images/data-eng-lifecycle.png :exports both
@startuml
[Ingestion] <-> [Storage]
[Storage] <-> [Transformation]
[Storage] <-> [Serving]

package Processing {
[Ingestion] --> [Transformation]
[Transformation] --> [Serving]
frame {
[Storage]
}

[Generation] --> [Ingestion]
frame Processing {
[Ingestion] -right-> [Transformation]
[Transformation] -right-> [Serving]

package Applications {
[Storage] -up-> [Ingestion]
[Storage] -up-> [Transformation]
[Storage] -up-> [Serving]
}

[Generation] -down-> [Processing]

frame Applications {
[Analytics]
[Machine Learning]
[Reverse ETL]
}

Serving ==> [Analytics]
Serving ==> [Machine Learning]
Serving ==> [Reverse ETL]

Serving =right=> Applications
@enduml
#+end_src
#+RESULTS:
[[file:./images/data-eng-lifecycle.png]]

** Undercurrents
*** [[id:6e9b50dc-c5c0-454d-ad99-e6b6968b221a][Security]]
*** Data Management
*** DataOps
*** Data Architecture
*** [[id:f822f8f6-89eb-4aa8-ac8f-fdcff3f06fb9][Orchestration]]
*** [[id:5c2039f5-0c44-4926-b2d7-a8bf471923ac][Software Engineering]]
** [[id:710e11f8-780a-4aa5-84fc-c0ab9bb848c0][Big Data]]
* Tooling
** [[id:7aa94354-25d9-441b-993f-31ccc970edd3][Hadoop]]
** [[id:1978cfeb-5ff8-49d1-a1e1-7306151f9850][Spark]]
** [[id:ebd4a55a-6d00-4c3f-9a8a-f806a3e5c2bc][Pig]]
** [[id:62ba92d7-598d-4cc9-b2bd-8bc7bcab7123][Hive]]
** [[id:bf454d38-3ffb-4ef7-9c3b-5e20b8a5b279][Dremel]]
** [[id:99aafe54-241d-4683-ae2d-4152bb9801fc][HBase]]
** [[id:11df321c-ace6-45f2-a080-bdfc2431ae3a][Storm]]
** [[id:20240519T221905.005300][Cassandra]]
** [[id:a34cc866-ec4b-44f5-972f-1c12782f649d][Presto]]
* Resources
** Books
- Fundamentals of Data Engineering
2 changes: 1 addition & 1 deletion Content/20240519221905-cassandra.org
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
:PROPERTIES:
:ID: 20240519T221905.005300
:END:
#+title: Cassandra
#+title: Apache Cassandra
#+filetags: :cs:data:

- [[id:a3d0278d-d7b7-47d8-956d-838b79396da7][distributed]] [[id:2f67eca9-5076-4895-828f-de3655444ee2][database]] (Key Value Store)
Expand Down
2 changes: 1 addition & 1 deletion Content/20240704145848-kafka.org
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
:PROPERTIES:
:ID: fa58feb4-25a2-40f1-8533-cafcb0d3886b
:END:
#+title: Kafka
#+title: Apache Kafka
#+filetags: :tool:programming:data:
32 changes: 31 additions & 1 deletion Content/20240805111148-hadoop.org
Original file line number Diff line number Diff line change
@@ -1,5 +1,35 @@
:PROPERTIES:
:ID: 7aa94354-25d9-441b-993f-31ccc970edd3
:END:
#+title: Hadoop
#+title: Apache Hadoop
#+filetags: :tool:data:

* Distributed File System

Hadoop is a widely-used framework for dealing with large data sets distributed across clusters of computers. Here's a concise breakdown:

- *Hadoop [[id:a3d0278d-d7b7-47d8-956d-838b79396da7][Distributed]] File System (HDFS):*
- *Purpose:* Designed to store very large files across multiple machines within a cluster.
- *Architecture:* Master-slave, consisting of a NameNode (master) and multiple DataNodes (slaves).
- *Characteristics:*
- *Fault Tolerance:* Automatically replicates data across multiple nodes.
- *High Throughput:* Designed to write once and read many times, optimizing access times.
- *Scalability:* Easily scalable by adding more nodes to the cluster.
- *Components:*
- *NameNode:* Manages metadata and namespace; it's a single point of failure in early versions.
- *DataNode:* Stores actual data and performs operations as instructed by NameNode.
- *Data Model:* Stores data in large blocks (default 128MB or larger), optimizing for large file storage operations.

*** Connections and Context:

- *Hadoop Ecosystem:*
- *[[id:2cc32697-c4ce-41b8-987a-2a44a09f78c3][MapReduce]]:* Programming model for large-scale data processing in parallel.
- *YARN (Yet Another Resource Negotiator):* Manages resources and job scheduling for Hadoop clusters.
- *Other Components:* Often includes elements like [[id:62ba92d7-598d-4cc9-b2bd-8bc7bcab7123][Hive]], [[id:ebd4a55a-6d00-4c3f-9a8a-f806a3e5c2bc][Pig]], and [[id:99aafe54-241d-4683-ae2d-4152bb9801fc][HBase]] for data querying and management.

- *Comparison with Other Systems:*
- *Vs. Traditional Relational Databases:* More suited to unstructured data and large-scale batch operations.
- *Vs. Other File Systems:* Specifically designed for distributed, parallel processing, unlike traditional distributed filesystems.

* Evolution
- See [[id:1978cfeb-5ff8-49d1-a1e1-7306151f9850][Apache Spark]]
3 changes: 2 additions & 1 deletion Content/20241023131919-aritificial_intelligence.org
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
#+filetags: :ai:

* Stream
** 0x22F1
** 0x22F4
- the [[id:3504d497-477f-467c-8d6b-d8096c7528c1][Data Science Hierarchy of Needs]]
** 0x22EC
- going for a pragmatic rediscovery of aspects of the field that can be applied to other domains
- have an academic foundation that could be leveraged into directing miscellaneous research streams
Expand Down
9 changes: 9 additions & 0 deletions Content/20241031095006-mark_twain.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
:PROPERTIES:
:ID: d5f7bb46-c839-4e75-99aa-c10eb87f9b5b
:END:
#+title: Mark Twain
#+filetags: :author:

* Quotes
** History
History doesn't repeat itself, but it rhymes.
5 changes: 5 additions & 0 deletions Content/20241031095928-apache_hive.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
:PROPERTIES:
:ID: 62ba92d7-598d-4cc9-b2bd-8bc7bcab7123
:END:
#+title: Apache Hive
#+filetags: :data:tool:
5 changes: 5 additions & 0 deletions Content/20241031095959-apache_pig.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
:PROPERTIES:
:ID: ebd4a55a-6d00-4c3f-9a8a-f806a3e5c2bc
:END:
#+title: Apache Pig
#+filetags: :tool:data:
5 changes: 5 additions & 0 deletions Content/20241031100013-apache_hbase.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
:PROPERTIES:
:ID: 99aafe54-241d-4683-ae2d-4152bb9801fc
:END:
#+title: Apache HBase
#+filetags: :data:tool:
46 changes: 46 additions & 0 deletions Content/20241031100518-apache_spark.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
:PROPERTIES:
:ID: 1978cfeb-5ff8-49d1-a1e1-7306151f9850
:END:
#+title: Apache Spark
#+filetags: :data:tool:

* Overview
** *Definition*
Apache Spark is an open-source unified analytics engine for large-scale data processing, known for its speed, ease of use, and sophisticated analytics.
** *Core Features*
*** Speed
Spark processes data in memory, reducing the time consumed for disk IO operations, which enhances the speed of data processing significantly.
*** Ease of Use
Spark provides simple and expressive APIs in Python, Java, Scala, and R, which makes it accessible to a wide range of developers and data scientists.
*** Advanced Analytics
Supports [[id:8bba90f5-5880-4c5d-b969-3ae17b53dc35][SQL]] queries, streaming data, [[id:20230713T110006.406161][machine learning]], and [[id:1d703f5b-8b5e-4c82-9393-a2c88294c959][graph]] processing.

** *Components*
*** Spark Core
The engine that handles memory management and data scheduling. It also provides the basic functionalities like task dispatching and inputoutput operations.
*** Spark SQL
Enables querying data via SQL as well as working with DataFrames and Datasets, which are distributed collections of data organized into named columns.
*** Spark Streaming
Allows for real-time data stream processing.
*** MLlib ([[id:20230713T110006.406161][Machine Learning]] Library)
A scalable machine-learning library that leverages Spark’s parallel processing capabilities.
*** GraphX
For graph processing and graph-parallel computation.

** *Deployment Modes*:
- Standalone: Runs as a separate cluster on your machine.
- YARN: Deploys within a Hadoop cluster using YARN (Yet Another Resource Negotiator).
- [[id:27a4d68c-adef-42aa-a4b4-b44b3f10395d][Mesos]]: Runs on Apache Mesos, a cluster manager that can also manage other distributed frameworks.
- [[id:c2072565-787a-4cea-9894-60fad254f61d][Kubernetes]]: Deployment on a Kubernetes-managed cluster.

** *Use Cases*:
- Real-time data analysis
- Batch processing
- Machine learning model training and evaluation
- Interactive data exploration

** Connections:
- Spark is often integrated with Hadoop’s HDFS for storage, utilizing Hadoop clusters to scale out data processing.
- It competes with tools like [[id:7aa94354-25d9-441b-993f-31ccc970edd3][Apache Hadoop]] [[id:2cc32697-c4ce-41b8-987a-2a44a09f78c3][MapReduce]] but offers significantly faster processing due to its in-memory capabilities.
- [[id:fa58feb4-25a2-40f1-8533-cafcb0d3886b][Apache Kafka]] is frequently used alongside Spark Streaming for real-time data ingestion.

5 changes: 5 additions & 0 deletions Content/20241031100932-sql.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
:PROPERTIES:
:ID: 8bba90f5-5880-4c5d-b969-3ae17b53dc35
:END:
#+title: SQL
#+filetags: :tool:data:
5 changes: 5 additions & 0 deletions Content/20241031102025-real_time_streaming.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
:PROPERTIES:
:ID: 869abfbd-031b-40a0-9c4b-69c3e7d820ab
:END:
#+title: Real Time Processing
#+filetags: :cs:data:
5 changes: 5 additions & 0 deletions Content/20241031102104-batch_computing.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
:PROPERTIES:
:ID: f4135d2f-3390-4d76-b05a-222f910c10d4
:END:
#+title: Batch Computing
#+filetags: :cs:
5 changes: 5 additions & 0 deletions Content/20241031102327-dremel.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
:PROPERTIES:
:ID: bf454d38-3ffb-4ef7-9c3b-5e20b8a5b279
:END:
#+title: Dremel
#+filetags: :data:
8 changes: 8 additions & 0 deletions Content/20241031102358-apache_storm.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
:PROPERTIES:
:ID: 11df321c-ace6-45f2-a080-bdfc2431ae3a
:END:
#+title: Apache Storm
#+filetags: :data:

* Relevant Nodes
** [[id:869abfbd-031b-40a0-9c4b-69c3e7d820ab][Real Time Processing]]
8 changes: 8 additions & 0 deletions Content/20241031102528-presto.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
:PROPERTIES:
:ID: a34cc866-ec4b-44f5-972f-1c12782f649d
:END:
#+title: Presto
#+filetags: :data:

* Relevant Nodes
** [[id:8bba90f5-5880-4c5d-b969-3ae17b53dc35][SQL]]
9 changes: 9 additions & 0 deletions Content/20241031103227-dan_ariely.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
:PROPERTIES:
:ID: 00b01069-1086-4597-aaba-5239efd0db23
:END:
#+title: Dan Ariely
#+filetags: :author:

* Quotes
** Big Data
Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.
9 changes: 9 additions & 0 deletions Content/20241031103656-matt_turck.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
:PROPERTIES:
:ID: 8de13d87-a38a-4868-9ae4-d59b8bf386ba
:END:
#+title: Matt Turck
#+filetags: :author:

* Works
** MAD 2024
- https://mattturck.com/mad2024/
31 changes: 31 additions & 0 deletions Content/20241031150229-data_science_hierarchy_of_needs.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
:PROPERTIES:
:ID: 3504d497-477f-467c-8d6b-d8096c7528c1
:END:
#+title: Data Science Hierarchy of Needs

From Upstream (root initiatives) to Downstream (consequent initiatives)

* collect
** instrumentation
** logging
** sensors
** external data
** user generated content
* move/store
** reliable data flow
** infrastructure
** pipelines
** ETL
** structured data storage
** unstructured data storage
* explore/transform
** cleaning
** anomaly detection
** prepprocessing/preparation
* aggregate/label
** A/B testing
** Experimentation
** simpler ML algorithms
* learn/optimize
** AI
** Deep Learning
Binary file modified Content/images/data-eng-lifecycle.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit e675386

Please sign in to comment.