updates

Signed-off-by: (Bit-Mage) <[email protected]>
rajp152k · Oct 31, 2024 · e675386 · e675386
1 parent 4c9b14c
commit e675386
Show file tree

Hide file tree

Showing 20 changed files with 226 additions and 16 deletions.
diff --git a/Content/20230717135201-data_engineering.org b/Content/20230717135201-data_engineering.org
@@ -5,6 +5,8 @@
 #+filetags: :data:
 
 * Stream
+** 0x22F4
+ - see [[id:869abfbd-031b-40a0-9c4b-69c3e7d820ab][real-time data streaming]] and [[id:f4135d2f-3390-4d76-b05a-222f910c10d4][batch computing]]
 ** 0x22F2
  - see [[id:1656ed9e-9ed0-4ddb-9953-98189f6bb42e][Extract, Transform, Load]]
  - see [[id:710e11f8-780a-4aa5-84fc-c0ab9bb848c0][Big Data]]
@@ -20,30 +22,52 @@
 ** Data Engineering Lifecycle
 #+begin_src plantuml :file ./images/data-eng-lifecycle.png :exports both
 @startuml
-[Ingestion] <-> [Storage]
-[Storage] <-> [Transformation]
-[Storage] <-> [Serving]
 
-package Processing {
-        [Ingestion] --> [Transformation]
-        [Transformation]  --> [Serving]
+frame {
+        [Storage]
 }
 
-[Generation] --> [Ingestion]
+frame Processing {
+        [Ingestion] -right-> [Transformation]
+        [Transformation]  -right-> [Serving]
 
-package Applications {
+        [Storage] -up-> [Ingestion]
+        [Storage] -up-> [Transformation]
+        [Storage] -up-> [Serving]
+}
+
+[Generation] -down-> [Processing]
+
+frame Applications {
 [Analytics]
 [Machine Learning]
 [Reverse ETL]
 }
 
-Serving ==> [Analytics]
-Serving ==> [Machine Learning]
-Serving ==> [Reverse ETL]
-
+Serving =right=> Applications
 @enduml
 #+end_src
+#+RESULTS:
+[[file:./images/data-eng-lifecycle.png]]
 
+** Undercurrents
+*** [[id:6e9b50dc-c5c0-454d-ad99-e6b6968b221a][Security]]
+*** Data Management
+*** DataOps
+*** Data Architecture
+*** [[id:f822f8f6-89eb-4aa8-ac8f-fdcff3f06fb9][Orchestration]]
+*** [[id:5c2039f5-0c44-4926-b2d7-a8bf471923ac][Software Engineering]]
+** [[id:710e11f8-780a-4aa5-84fc-c0ab9bb848c0][Big Data]]
+* Tooling
+** [[id:7aa94354-25d9-441b-993f-31ccc970edd3][Hadoop]]
+** [[id:1978cfeb-5ff8-49d1-a1e1-7306151f9850][Spark]]
+** [[id:ebd4a55a-6d00-4c3f-9a8a-f806a3e5c2bc][Pig]]
+** [[id:62ba92d7-598d-4cc9-b2bd-8bc7bcab7123][Hive]]
+** [[id:bf454d38-3ffb-4ef7-9c3b-5e20b8a5b279][Dremel]]
+** [[id:99aafe54-241d-4683-ae2d-4152bb9801fc][HBase]]
+** [[id:11df321c-ace6-45f2-a080-bdfc2431ae3a][Storm]]
+** [[id:20240519T221905.005300][Cassandra]]
+** [[id:a34cc866-ec4b-44f5-972f-1c12782f649d][Presto]]
 * Resources
 ** Books
  - Fundamentals of Data Engineering
diff --git a/Content/20240519221905-cassandra.org b/Content/20240519221905-cassandra.org
@@ -1,7 +1,7 @@
 :PROPERTIES:
 :ID:       20240519T221905.005300
 :END:
-#+title: Cassandra
+#+title: Apache Cassandra
 #+filetags: :cs:data:
 
  - [[id:a3d0278d-d7b7-47d8-956d-838b79396da7][distributed]] [[id:2f67eca9-5076-4895-828f-de3655444ee2][database]] (Key Value Store)

diff --git a/Content/20240704145848-kafka.org b/Content/20240704145848-kafka.org
@@ -1,5 +1,5 @@
 :PROPERTIES:
 :ID:       fa58feb4-25a2-40f1-8533-cafcb0d3886b
 :END:
-#+title: Kafka
+#+title: Apache Kafka
 #+filetags: :tool:programming:data:
diff --git a/Content/20240805111148-hadoop.org b/Content/20240805111148-hadoop.org
@@ -1,5 +1,35 @@
 :PROPERTIES:
 :ID:       7aa94354-25d9-441b-993f-31ccc970edd3
 :END:
-#+title: Hadoop
+#+title: Apache Hadoop
 #+filetags: :tool:data:
+
+* Distributed File System
+
+Hadoop is a widely-used framework for dealing with large data sets distributed across clusters of computers. Here's a concise breakdown:
+
+- *Hadoop [[id:a3d0278d-d7b7-47d8-956d-838b79396da7][Distributed]] File System (HDFS):*
+  - *Purpose:* Designed to store very large files across multiple machines within a cluster.
+  - *Architecture:* Master-slave, consisting of a NameNode (master) and multiple DataNodes (slaves).
+  - *Characteristics:*
+    - *Fault Tolerance:* Automatically replicates data across multiple nodes.
+    - *High Throughput:* Designed to write once and read many times, optimizing access times.
+    - *Scalability:* Easily scalable by adding more nodes to the cluster.
+  - *Components:*
+    - *NameNode:* Manages metadata and namespace; it's a single point of failure in early versions.
+    - *DataNode:* Stores actual data and performs operations as instructed by NameNode.
+  - *Data Model:* Stores data in large blocks (default 128MB or larger), optimizing for large file storage operations.
+
+*** Connections and Context:
+
+- *Hadoop Ecosystem:*
+  - *[[id:2cc32697-c4ce-41b8-987a-2a44a09f78c3][MapReduce]]:* Programming model for large-scale data processing in parallel.
+  - *YARN (Yet Another Resource Negotiator):* Manages resources and job scheduling for Hadoop clusters.
+  - *Other Components:* Often includes elements like [[id:62ba92d7-598d-4cc9-b2bd-8bc7bcab7123][Hive]], [[id:ebd4a55a-6d00-4c3f-9a8a-f806a3e5c2bc][Pig]], and [[id:99aafe54-241d-4683-ae2d-4152bb9801fc][HBase]] for data querying and management.
+
+- *Comparison with Other Systems:*
+  - *Vs. Traditional Relational Databases:* More suited to unstructured data and large-scale batch operations.
+  - *Vs. Other File Systems:* Specifically designed for distributed, parallel processing, unlike traditional distributed filesystems.
+
+* Evolution
+ - See [[id:1978cfeb-5ff8-49d1-a1e1-7306151f9850][Apache Spark]]
diff --git a/Content/20241023131919-aritificial_intelligence.org b/Content/20241023131919-aritificial_intelligence.org
@@ -5,7 +5,8 @@
 #+filetags: :ai:
 
 * Stream
-** 0x22F1
+** 0x22F4
+ - the [[id:3504d497-477f-467c-8d6b-d8096c7528c1][Data Science Hierarchy of Needs]]
 ** 0x22EC
  - going for a pragmatic rediscovery of aspects of the field that can be applied to other domains
  - have an academic foundation that could be leveraged into directing miscellaneous research streams

diff --git a/Content/20241031095006-mark_twain.org b/Content/20241031095006-mark_twain.org
@@ -0,0 +1,9 @@
+:PROPERTIES:
+:ID:       d5f7bb46-c839-4e75-99aa-c10eb87f9b5b
+:END:
+#+title: Mark Twain
+#+filetags: :author:
+
+* Quotes
+** History
+History doesn't repeat itself, but it rhymes.
diff --git a/Content/20241031095928-apache_hive.org b/Content/20241031095928-apache_hive.org
@@ -0,0 +1,5 @@
+:PROPERTIES:
+:ID:       62ba92d7-598d-4cc9-b2bd-8bc7bcab7123
+:END:
+#+title: Apache Hive
+#+filetags: :data:tool:
diff --git a/Content/20241031095959-apache_pig.org b/Content/20241031095959-apache_pig.org
@@ -0,0 +1,5 @@
+:PROPERTIES:
+:ID:       ebd4a55a-6d00-4c3f-9a8a-f806a3e5c2bc
+:END:
+#+title: Apache Pig
+#+filetags: :tool:data:
diff --git a/Content/20241031100013-apache_hbase.org b/Content/20241031100013-apache_hbase.org
@@ -0,0 +1,5 @@
+:PROPERTIES:
+:ID:       99aafe54-241d-4683-ae2d-4152bb9801fc
+:END:
+#+title: Apache HBase
+#+filetags: :data:tool:
diff --git a/Content/20241031100518-apache_spark.org b/Content/20241031100518-apache_spark.org
@@ -0,0 +1,46 @@
+:PROPERTIES:
+:ID:       1978cfeb-5ff8-49d1-a1e1-7306151f9850
+:END:
+#+title: Apache Spark
+#+filetags: :data:tool:
+
+* Overview
+** *Definition*
+Apache Spark is an open-source unified analytics engine for large-scale data processing, known for its speed, ease of use, and sophisticated analytics.
+** *Core Features*
+*** Speed
+ Spark processes data in memory, reducing the time consumed for disk IO operations, which enhances the speed of data processing significantly.
+*** Ease of Use
+ Spark provides simple and expressive APIs in Python, Java, Scala, and R, which makes it accessible to a wide range of developers and data scientists.
+*** Advanced Analytics
+ Supports [[id:8bba90f5-5880-4c5d-b969-3ae17b53dc35][SQL]] queries, streaming data, [[id:20230713T110006.406161][machine learning]], and [[id:1d703f5b-8b5e-4c82-9393-a2c88294c959][graph]] processing.
+
+** *Components*
+*** Spark Core
+ The engine that handles memory management and data scheduling. It also provides the basic functionalities like task dispatching and inputoutput operations.
+*** Spark SQL
+ Enables querying data via SQL as well as working with DataFrames and Datasets, which are distributed collections of data organized into named columns.
+*** Spark Streaming
+ Allows for real-time data stream processing.
+*** MLlib ([[id:20230713T110006.406161][Machine Learning]] Library)
+ A scalable machine-learning library that leverages Spark’s parallel processing capabilities.
+*** GraphX
+ For graph processing and graph-parallel computation.
+
+** *Deployment Modes*:
+  - Standalone: Runs as a separate cluster on your machine.
+  - YARN: Deploys within a Hadoop cluster using YARN (Yet Another Resource Negotiator).
+  - [[id:27a4d68c-adef-42aa-a4b4-b44b3f10395d][Mesos]]: Runs on Apache Mesos, a cluster manager that can also manage other distributed frameworks.
+  - [[id:c2072565-787a-4cea-9894-60fad254f61d][Kubernetes]]: Deployment on a Kubernetes-managed cluster.
+
+** *Use Cases*:
+  - Real-time data analysis
+  - Batch processing
+  - Machine learning model training and evaluation
+  - Interactive data exploration
+
+** Connections:
+- Spark is often integrated with Hadoop’s HDFS for storage, utilizing Hadoop clusters to scale out data processing.
+- It competes with tools like [[id:7aa94354-25d9-441b-993f-31ccc970edd3][Apache Hadoop]] [[id:2cc32697-c4ce-41b8-987a-2a44a09f78c3][MapReduce]] but offers significantly faster processing due to its in-memory capabilities.
+- [[id:fa58feb4-25a2-40f1-8533-cafcb0d3886b][Apache Kafka]] is frequently used alongside Spark Streaming for real-time data ingestion.
+
diff --git a/Content/20241031100932-sql.org b/Content/20241031100932-sql.org
@@ -0,0 +1,5 @@
+:PROPERTIES:
+:ID:       8bba90f5-5880-4c5d-b969-3ae17b53dc35
+:END:
+#+title: SQL
+#+filetags: :tool:data:
diff --git a/Content/20241031102025-real_time_streaming.org b/Content/20241031102025-real_time_streaming.org
@@ -0,0 +1,5 @@
+:PROPERTIES:
+:ID:       869abfbd-031b-40a0-9c4b-69c3e7d820ab
+:END:
+#+title: Real Time Processing
+#+filetags: :cs:data:
diff --git a/Content/20241031102104-batch_computing.org b/Content/20241031102104-batch_computing.org
@@ -0,0 +1,5 @@
+:PROPERTIES:
+:ID:       f4135d2f-3390-4d76-b05a-222f910c10d4
+:END:
+#+title: Batch Computing
+#+filetags: :cs:
diff --git a/Content/20241031102327-dremel.org b/Content/20241031102327-dremel.org
@@ -0,0 +1,5 @@
+:PROPERTIES:
+:ID:       bf454d38-3ffb-4ef7-9c3b-5e20b8a5b279
+:END:
+#+title: Dremel
+#+filetags: :data:
diff --git a/Content/20241031102358-apache_storm.org b/Content/20241031102358-apache_storm.org
@@ -0,0 +1,8 @@
+:PROPERTIES:
+:ID:       11df321c-ace6-45f2-a080-bdfc2431ae3a
+:END:
+#+title: Apache Storm
+#+filetags: :data:
+
+* Relevant Nodes
+** [[id:869abfbd-031b-40a0-9c4b-69c3e7d820ab][Real Time Processing]]
diff --git a/Content/20241031102528-presto.org b/Content/20241031102528-presto.org
@@ -0,0 +1,8 @@
+:PROPERTIES:
+:ID:       a34cc866-ec4b-44f5-972f-1c12782f649d
+:END:
+#+title: Presto
+#+filetags: :data:
+
+* Relevant Nodes
+** [[id:8bba90f5-5880-4c5d-b969-3ae17b53dc35][SQL]]
diff --git a/Content/20241031103227-dan_ariely.org b/Content/20241031103227-dan_ariely.org
@@ -0,0 +1,9 @@
+:PROPERTIES:
+:ID:       00b01069-1086-4597-aaba-5239efd0db23
+:END:
+#+title: Dan Ariely
+#+filetags: :author:
+
+* Quotes
+** Big Data
+Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.
diff --git a/Content/20241031103656-matt_turck.org b/Content/20241031103656-matt_turck.org
@@ -0,0 +1,9 @@
+:PROPERTIES:
+:ID:       8de13d87-a38a-4868-9ae4-d59b8bf386ba
+:END:
+#+title: Matt Turck
+#+filetags: :author:
+
+* Works
+** MAD 2024
+ - https://mattturck.com/mad2024/
diff --git a/Content/20241031150229-data_science_hierarchy_of_needs.org b/Content/20241031150229-data_science_hierarchy_of_needs.org
@@ -0,0 +1,31 @@
+:PROPERTIES:
+:ID:       3504d497-477f-467c-8d6b-d8096c7528c1
+:END:
+#+title: Data Science Hierarchy of Needs
+
+From Upstream (root initiatives) to Downstream (consequent initiatives)
+
+* collect
+** instrumentation
+** logging
+** sensors
+** external data
+** user generated content
+* move/store
+** reliable data flow
+** infrastructure
+** pipelines
+** ETL
+** structured data storage
+** unstructured data storage
+* explore/transform
+** cleaning
+** anomaly detection
+** prepprocessing/preparation
+* aggregate/label
+** A/B testing
+** Experimentation
+** simpler ML algorithms
+* learn/optimize
+** AI
+** Deep Learning
diff --git a/Content/images/data-eng-lifecycle.png b/Content/images/data-eng-lifecycle.png