-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: (Bit-Mage) <[email protected]>
- Loading branch information
(Bit-Mage)
committed
Oct 31, 2024
1 parent
4c9b14c
commit e675386
Showing
20 changed files
with
226 additions
and
16 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
:PROPERTIES: | ||
:ID: fa58feb4-25a2-40f1-8533-cafcb0d3886b | ||
:END: | ||
#+title: Kafka | ||
#+title: Apache Kafka | ||
#+filetags: :tool:programming:data: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,35 @@ | ||
:PROPERTIES: | ||
:ID: 7aa94354-25d9-441b-993f-31ccc970edd3 | ||
:END: | ||
#+title: Hadoop | ||
#+title: Apache Hadoop | ||
#+filetags: :tool:data: | ||
|
||
* Distributed File System | ||
|
||
Hadoop is a widely-used framework for dealing with large data sets distributed across clusters of computers. Here's a concise breakdown: | ||
|
||
- *Hadoop [[id:a3d0278d-d7b7-47d8-956d-838b79396da7][Distributed]] File System (HDFS):* | ||
- *Purpose:* Designed to store very large files across multiple machines within a cluster. | ||
- *Architecture:* Master-slave, consisting of a NameNode (master) and multiple DataNodes (slaves). | ||
- *Characteristics:* | ||
- *Fault Tolerance:* Automatically replicates data across multiple nodes. | ||
- *High Throughput:* Designed to write once and read many times, optimizing access times. | ||
- *Scalability:* Easily scalable by adding more nodes to the cluster. | ||
- *Components:* | ||
- *NameNode:* Manages metadata and namespace; it's a single point of failure in early versions. | ||
- *DataNode:* Stores actual data and performs operations as instructed by NameNode. | ||
- *Data Model:* Stores data in large blocks (default 128MB or larger), optimizing for large file storage operations. | ||
|
||
*** Connections and Context: | ||
|
||
- *Hadoop Ecosystem:* | ||
- *[[id:2cc32697-c4ce-41b8-987a-2a44a09f78c3][MapReduce]]:* Programming model for large-scale data processing in parallel. | ||
- *YARN (Yet Another Resource Negotiator):* Manages resources and job scheduling for Hadoop clusters. | ||
- *Other Components:* Often includes elements like [[id:62ba92d7-598d-4cc9-b2bd-8bc7bcab7123][Hive]], [[id:ebd4a55a-6d00-4c3f-9a8a-f806a3e5c2bc][Pig]], and [[id:99aafe54-241d-4683-ae2d-4152bb9801fc][HBase]] for data querying and management. | ||
|
||
- *Comparison with Other Systems:* | ||
- *Vs. Traditional Relational Databases:* More suited to unstructured data and large-scale batch operations. | ||
- *Vs. Other File Systems:* Specifically designed for distributed, parallel processing, unlike traditional distributed filesystems. | ||
|
||
* Evolution | ||
- See [[id:1978cfeb-5ff8-49d1-a1e1-7306151f9850][Apache Spark]] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
:PROPERTIES: | ||
:ID: d5f7bb46-c839-4e75-99aa-c10eb87f9b5b | ||
:END: | ||
#+title: Mark Twain | ||
#+filetags: :author: | ||
|
||
* Quotes | ||
** History | ||
History doesn't repeat itself, but it rhymes. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
:PROPERTIES: | ||
:ID: 62ba92d7-598d-4cc9-b2bd-8bc7bcab7123 | ||
:END: | ||
#+title: Apache Hive | ||
#+filetags: :data:tool: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
:PROPERTIES: | ||
:ID: ebd4a55a-6d00-4c3f-9a8a-f806a3e5c2bc | ||
:END: | ||
#+title: Apache Pig | ||
#+filetags: :tool:data: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
:PROPERTIES: | ||
:ID: 99aafe54-241d-4683-ae2d-4152bb9801fc | ||
:END: | ||
#+title: Apache HBase | ||
#+filetags: :data:tool: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
:PROPERTIES: | ||
:ID: 1978cfeb-5ff8-49d1-a1e1-7306151f9850 | ||
:END: | ||
#+title: Apache Spark | ||
#+filetags: :data:tool: | ||
|
||
* Overview | ||
** *Definition* | ||
Apache Spark is an open-source unified analytics engine for large-scale data processing, known for its speed, ease of use, and sophisticated analytics. | ||
** *Core Features* | ||
*** Speed | ||
Spark processes data in memory, reducing the time consumed for disk IO operations, which enhances the speed of data processing significantly. | ||
*** Ease of Use | ||
Spark provides simple and expressive APIs in Python, Java, Scala, and R, which makes it accessible to a wide range of developers and data scientists. | ||
*** Advanced Analytics | ||
Supports [[id:8bba90f5-5880-4c5d-b969-3ae17b53dc35][SQL]] queries, streaming data, [[id:20230713T110006.406161][machine learning]], and [[id:1d703f5b-8b5e-4c82-9393-a2c88294c959][graph]] processing. | ||
|
||
** *Components* | ||
*** Spark Core | ||
The engine that handles memory management and data scheduling. It also provides the basic functionalities like task dispatching and inputoutput operations. | ||
*** Spark SQL | ||
Enables querying data via SQL as well as working with DataFrames and Datasets, which are distributed collections of data organized into named columns. | ||
*** Spark Streaming | ||
Allows for real-time data stream processing. | ||
*** MLlib ([[id:20230713T110006.406161][Machine Learning]] Library) | ||
A scalable machine-learning library that leverages Spark’s parallel processing capabilities. | ||
*** GraphX | ||
For graph processing and graph-parallel computation. | ||
|
||
** *Deployment Modes*: | ||
- Standalone: Runs as a separate cluster on your machine. | ||
- YARN: Deploys within a Hadoop cluster using YARN (Yet Another Resource Negotiator). | ||
- [[id:27a4d68c-adef-42aa-a4b4-b44b3f10395d][Mesos]]: Runs on Apache Mesos, a cluster manager that can also manage other distributed frameworks. | ||
- [[id:c2072565-787a-4cea-9894-60fad254f61d][Kubernetes]]: Deployment on a Kubernetes-managed cluster. | ||
|
||
** *Use Cases*: | ||
- Real-time data analysis | ||
- Batch processing | ||
- Machine learning model training and evaluation | ||
- Interactive data exploration | ||
|
||
** Connections: | ||
- Spark is often integrated with Hadoop’s HDFS for storage, utilizing Hadoop clusters to scale out data processing. | ||
- It competes with tools like [[id:7aa94354-25d9-441b-993f-31ccc970edd3][Apache Hadoop]] [[id:2cc32697-c4ce-41b8-987a-2a44a09f78c3][MapReduce]] but offers significantly faster processing due to its in-memory capabilities. | ||
- [[id:fa58feb4-25a2-40f1-8533-cafcb0d3886b][Apache Kafka]] is frequently used alongside Spark Streaming for real-time data ingestion. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
:PROPERTIES: | ||
:ID: 8bba90f5-5880-4c5d-b969-3ae17b53dc35 | ||
:END: | ||
#+title: SQL | ||
#+filetags: :tool:data: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
:PROPERTIES: | ||
:ID: 869abfbd-031b-40a0-9c4b-69c3e7d820ab | ||
:END: | ||
#+title: Real Time Processing | ||
#+filetags: :cs:data: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
:PROPERTIES: | ||
:ID: f4135d2f-3390-4d76-b05a-222f910c10d4 | ||
:END: | ||
#+title: Batch Computing | ||
#+filetags: :cs: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
:PROPERTIES: | ||
:ID: bf454d38-3ffb-4ef7-9c3b-5e20b8a5b279 | ||
:END: | ||
#+title: Dremel | ||
#+filetags: :data: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
:PROPERTIES: | ||
:ID: 11df321c-ace6-45f2-a080-bdfc2431ae3a | ||
:END: | ||
#+title: Apache Storm | ||
#+filetags: :data: | ||
|
||
* Relevant Nodes | ||
** [[id:869abfbd-031b-40a0-9c4b-69c3e7d820ab][Real Time Processing]] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
:PROPERTIES: | ||
:ID: a34cc866-ec4b-44f5-972f-1c12782f649d | ||
:END: | ||
#+title: Presto | ||
#+filetags: :data: | ||
|
||
* Relevant Nodes | ||
** [[id:8bba90f5-5880-4c5d-b969-3ae17b53dc35][SQL]] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
:PROPERTIES: | ||
:ID: 00b01069-1086-4597-aaba-5239efd0db23 | ||
:END: | ||
#+title: Dan Ariely | ||
#+filetags: :author: | ||
|
||
* Quotes | ||
** Big Data | ||
Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
:PROPERTIES: | ||
:ID: 8de13d87-a38a-4868-9ae4-d59b8bf386ba | ||
:END: | ||
#+title: Matt Turck | ||
#+filetags: :author: | ||
|
||
* Works | ||
** MAD 2024 | ||
- https://mattturck.com/mad2024/ |
31 changes: 31 additions & 0 deletions
31
Content/20241031150229-data_science_hierarchy_of_needs.org
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
:PROPERTIES: | ||
:ID: 3504d497-477f-467c-8d6b-d8096c7528c1 | ||
:END: | ||
#+title: Data Science Hierarchy of Needs | ||
|
||
From Upstream (root initiatives) to Downstream (consequent initiatives) | ||
|
||
* collect | ||
** instrumentation | ||
** logging | ||
** sensors | ||
** external data | ||
** user generated content | ||
* move/store | ||
** reliable data flow | ||
** infrastructure | ||
** pipelines | ||
** ETL | ||
** structured data storage | ||
** unstructured data storage | ||
* explore/transform | ||
** cleaning | ||
** anomaly detection | ||
** prepprocessing/preparation | ||
* aggregate/label | ||
** A/B testing | ||
** Experimentation | ||
** simpler ML algorithms | ||
* learn/optimize | ||
** AI | ||
** Deep Learning |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.