Skip to content

Commit

Permalink
updates
Browse files Browse the repository at this point in the history
Signed-off-by: (Bit-Mage) <[email protected]>
  • Loading branch information
(Bit-Mage) committed Nov 2, 2024
1 parent 308386a commit 7cdbe46
Show file tree
Hide file tree
Showing 25 changed files with 418 additions and 32 deletions.
84 changes: 67 additions & 17 deletions Content/20230717135201-data_engineering.org
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,10 @@
#+title: Data Engineering
#+filetags: :data:

* Stream
** 0x22F4
- see [[id:869abfbd-031b-40a0-9c4b-69c3e7d820ab][real-time data streaming]] and [[id:f4135d2f-3390-4d76-b05a-222f910c10d4][batch computing]]
** 0x22F2
- see [[id:1656ed9e-9ed0-4ddb-9953-98189f6bb42e][Extract, Transform, Load]]
- see [[id:710e11f8-780a-4aa5-84fc-c0ab9bb848c0][Big Data]]
** 0x22F2
- starting out with setting up a data lake
- reading on the book : fundamentals of data engineering
- will be populating a lot of relevant nodes that demand further exploration
- all tagged as =:data:=
- do have foundational data science experience; intrigued to explore how it scales out operationally.
- will not be able to do and end to end over view of the whole field.
- will be indexing into relevant nodes from the stream sub nodes instead.
* Core Nodes
** Data Engineering Lifecycle
*** Overview
**** Overall Flow
#+begin_src plantuml :file ./images/data-eng-lifecycle.png :exports both
@startuml

Expand Down Expand Up @@ -67,17 +54,77 @@ Serving =right=> Applications
- they also need to have side-car processing capabilities to serve complex queries
- storage is omnipresent across the cycle from ingestion to serving results and the transformations sandwiched within
- streaming frameworks like [[id:fa58feb4-25a2-40f1-8533-cafcb0d3886b][apache kafka]] and [[id:5e438030-0096-4b97-8931-f99eb7b738c5][pulsar]] can simultaneously function as ingestion, storage and query systems for messages
**** Ingestion
**** [[id:5cc98814-915c-4e20-a8e5-82ddd6783466][Ingestion]]
**** Transformation

In the data engineering lifecycle, the transformation process is a critical stage where raw data is converted into a suitable format for analysis and utilization. Here are the key aspects of the transformation process:

- *Extraction*:
- Raw data is sourced from multiple origins, including databases, external data feeds, sensors, and more.

- *Data Cleaning*:
- Removing duplicates, correcting errors, and filling in missing values to ensure data quality.
- Standardizing data formats and naming conventions for consistency.

- *Data Integration*:
- Combining data from different sources to provide a unified view.
- Resolving heterogeneities and conflicts in data schemas.

- *Data Transformation*:
- Changing data from its original form into a format that is analyzable. This includes:
- *Normalization/Denormalization*: Adjusting the data structure for better access or storage.
- *Aggregation*: Summarizing data to provide insights at a higher level.
- *Enrichment*: Adding new data fields derived from existing data to enhance context.

- *Filtering*:
- Removing unnecessary or irrelevant data to focus on what's important.

- *Feature Engineering*:
- Creating new variables or modifying existing ones to improve the performance of models.

- *Validation*:
- Ensuring that transformed data meets quality and integrity standards.
- Conducting checks against business rules and expectations.

***** *Connections and Importance*:
- The transformation process is intrinsically connected to subsequent stages of data analytics and machine learning, as the quality and structure of transformed data directly impact the performance of analytics models.
- It ensures that data is suitable for storage in a data warehouse or data lake, where further data exploration can occur.
- By transforming data appropriately, businesses can derive actionable insights that drive strategic decisions.

**** Serving
**** Applications
- [[id:552f0396-488d-43d8-8b44-f68dff74fa5e][Analytics]]
- [[id:49b0dd1e-ca9e-46fa-a0b9-db0ec330833d][MultiTenancy]]
- [[id:20230713T110006.406161][Machine Learning]]
- Reverse [[id:1656ed9e-9ed0-4ddb-9953-98189f6bb42e][ETL]]
*** Undercurrents
**** [[id:6e9b50dc-c5c0-454d-ad99-e6b6968b221a][Security]]
- Access Control for:
- Data
- Systems
- [[id:d4f81cb7-e01b-4115-b8a1-9a303a82699d][The Principle of Least Privilege]]
**** Data Management
- Data Governance
- Discoverability
- Definitions
- Accountability
- Data Modeling
- Data Integrity
**** DataOps
- Data Governance
- Observability and Monitoring
- Incident Reporting
**** Data Architecture
- Analyse tradeoffs
- Design for agility
- Add value to the business
**** [[id:f822f8f6-89eb-4aa8-ac8f-fdcff3f06fb9][Orchestration]]
- Coordinate workflows
- Schedule jobs
- Manage tasks
**** [[id:5c2039f5-0c44-4926-b2d7-a8bf471923ac][Software Engineering]]
- Programming and coding skills
- Software Design Patterns
- Testing and Debugging
*** [[id:9204583f-13ab-4039-9bfc-453700f8b0d1][The Data Life Cycle]]
- The Data engineering lifecycle is a subset of the data life cycle (explored separately)
** [[id:710e11f8-780a-4aa5-84fc-c0ab9bb848c0][Big Data]]
Expand All @@ -93,4 +140,7 @@ Serving =right=> Applications
** [[id:a34cc866-ec4b-44f5-972f-1c12782f649d][Presto]]
* Resources
** Books
- Fundamentals of Data Engineering
*** Fundamentals of Data Engineering
** Articles
*** Data Observability Driven Development
- https://www.kensu.io/blog/a-guide-to-understanding-data-observability-driven-development
3 changes: 0 additions & 3 deletions Content/20230720113957-graphs.org
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,6 @@ A mathematical protocal used to represent suitable abstractions using nodes and
- Edges : a connection with optional properties between entities
* Prominent Variants
** Directed Acyclic Graph
:PROPERTIES:
:ID: d07976cd-5194-484e-82ab-8c55e064eeb1
:END:
- directed edges, no cycles
- many applications
- check out [[id:78d16b5e-1893-4057-bc22-b2c9a3ca7ed6][Topological Sort]] for a practical application
Expand Down
5 changes: 2 additions & 3 deletions Content/20230720114059-data.org
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,5 @@
- I'm not sure what particular mechanism/algorithm might be an accurate represent of how one performs search over one's [[id:401e1c2b-fc54-4bee-9a38-d084b8904693][Memory (Mind).]]
- I personally feel like I'm unconsciously employing a hierarchical index (a [[id:1d703f5b-8b5e-4c82-9393-a2c88294c959][Graph]]) that leads me from a root event to some specifics conveniently but am yet to fully test out this hypothesis and will refrain from academic exploration but proceed with good old greek thought experiments.
** [[id:20230713T110006.406161][Machine Learning]]
** [[id:2f67eca9-5076-4895-828f-de3655444ee2][DataBase]]
** [[id:2cc32697-c4ce-41b8-987a-2a44a09f78c3][MapReduce]]
** [[id:665e997a-5628-4481-902c-47af4ba30336][Logs]]
** [[id:e9d75f9d-f8bf-4125-beb0-8ca34166ce9e][Data Engineering]]
** [[id:552f0396-488d-43d8-8b44-f68dff74fa5e][Analytics]]
2 changes: 2 additions & 0 deletions Content/20240114203847-agent.org
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,5 @@ An [[id:20240114T203601.390070][Entity]] with the ability (at-least) to perform
Complex versions there-of might delve into the ability to think, comprehend, formulate, make decisions over the basic requirement of being capable of acting.

An Agent that only thinks and doesn't act (or isn't capable of consequences) isn't an entity of significant consequence.

See [[id:a819cd68-91f9-4d67-b40f-fc37324f708b][Agentic AI]]
60 changes: 57 additions & 3 deletions Content/20240220114146-electronic_storage.org
Original file line number Diff line number Diff line change
@@ -1,7 +1,61 @@
:PROPERTIES:
:ID: 18491388-2dcc-488f-8f33-00582cf0f77e
:END:
#+title: Electronic Storage
#+filetags: :electronics:cs:
#+title: Storage
#+filetags: :data:cs:

* Sentinels
* Overview
** *Types of Storage*:
- Primary Storage: Also known as volatile memory or RAM, it is used by computers to temporarily store data that is actively being used or processed.
- Secondary Storage: Refers to non-volatile storage like hard drives (HDDs), solid-state drives (SSDs), and optical discs where data is stored for long-term retention.
- Tertiary Storage: Involves storage systems used for archiving and backup such as tape drives or cloud-based cold storage solutions.
- Quaternary Storage: Rarely used term, sometimes refers to off-site storage systems or lesser-used forms like microforms.

** *Storage Technologies*:
- Magnetic Storage: Utilizes magnetic media to store data (e.g., HDDs, magnetic tapes).
- Optical Storage: Uses lasers to readwrite data (e.g., CDs, DVDs, Blu-rays).
- Flash Storage: A form of EEPROM, non-volatile storage technology used in SSDs, USB flash drives.
- Cloud Storage: Allows data to be stored and accessed over the internet, offered by providers like AWS, Google Cloud, Azure.

** *Key Concepts*:
- Volatility: Determines whether storage retains data when power is lost.
- Capacity: Amount of data a storage medium can hold.
- Speed: Access time and data transfer rates of a storage medium.
- Durability: Resistance to physical wear and data deterioration over time.
* Misc
** Understanding Data Access Frequency
*** Temperatures
**** Hot Data
- more than many times per day
- could be several times per second
**** Cold Data
- seldom queried
- often retained for compliance purposes
- backups in cases of catastrophic failures
** Handy Questions to evaluate Storage systems

These are some questions that help gauge the choices of storage systems when architecting a data solution such as:
- [[id:cfa5fba0-eb2d-4e71-b17a-c646149ab27e][data warehouse]]
- [[id:796b4db7-42dc-4783-bb05-b15524ddf117][data lakehouse]]
- [[id:2f67eca9-5076-4895-828f-de3655444ee2][database]]
- [[id:add20973-54a9-4d96-a938-b27ccbf9c1e6][object storage]]

** Questions
- Is this storage solution compatible with the architecture's required read and write speeds?
- Will storage create a bottleneck for downstream processes?
- Do you understand how this storage technology works?
- are you using the storage system optimally or commiting unnatural acts?
- for instance: are you applying a high rate of random access updates in an object storage (an antipattern)
- Will this storage system handle anticipated future scale?
- you should consider all capacity limits on the storage system: total available storage, read operation rate, write volume, etc
- Will downstream users and processes be able to retrieve data in the required [[id:079db37b-925c-478a-836f-7f6ce8027108][service level agreement]]
- Are you capturing [[id:5c5245d1-4919-4e13-9232-410f324c0288][metadata]] about the schema evolution, data flows, data lineage and so forth?
- Metadata has a significant impact on the utility of data
- Metadata represents an investment in the future, dramatically enhancing discoverability and institutional knowledge to streamline future projects and architecture changes.
- Is this a pure storage solution (object storage), or does it support complex query patterns (i.e. a cloud data warehouse)?
- Is the storage system schema-agnostic (object storage)? Flexible schema (Cassandra)? Enforced Schema (a cloud data warehouse)?
- How are you tracking master data, golden records data quality, and data lineage for data governance?
- How are you handling regulatory compliance and data sovereignty? For example, can you store your data in certain geographical locations but not others?
* Relevant Nodes
- [[id:e9d75f9d-f8bf-4125-beb0-8ca34166ce9e][Data Engineering]]
- [[id:1073cfed-a09d-48b6-bd52-ba09708699bf][Message Brokers]]
2 changes: 1 addition & 1 deletion Content/20240807085439-compliance.org
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
:PROPERTIES:
:ID: 06cb8fe6-cf1e-4c0c-afdc-f16ab38414ef
:END:
#+title: compliance
#+title: Compliance
#+filetags: :bs:
34 changes: 32 additions & 2 deletions Content/20241029124743-etl.org
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,35 @@
- Helps organizations in making informed business decisions through data insights.
- Enables better data integration across disparate data sources.

* Relevant Nodes
** [[id:015cb100-bd71-4e98-ae7f-03d547b048e5][ELT]] (Extract, Load, Transform)
* Reverse ETL
Reverse ETL is a concept within data management and analytics, specifically within the broader context of data integration and transformation processes.

** *Definition*:
- Reverse ETL refers to the process of moving data from a centralized data warehouse or data lake back to operational systems (like CRM, marketing tools, or sales platforms) to make it actionable for various business operations.

** *Components Involved*:
- *ETL Process*: Extract, Transform, Load (ETL) traditionally involves moving data from operational systems into a [[id:cfa5fba0-eb2d-4e71-b17a-c646149ab27e][data warehouse]] for analysis.
- *Reverse Process*: Reverse ETL involves taking insights or aggregated data from the data warehouse and pushing it back into operational tools for real-time business use.

** *Purpose*:
- Operationalize data insights, allowing teams to act based on centralized data analysis directly within their tools.
- Enhance decision-making with enriched data that is more comprehensively processed within data warehouses.

** *Technologies & Tools*:
- Tools like Census, Hightouch, and Grouparoo specifically cater to reverse ETL functions, enabling data movement back into operational systems.
- These tools integrate with data warehouses such as Snowflake, BigQuery, or Redshift.

** *Use Cases*:
- Enabling marketing automation systems with enhanced customer insights.
- Sending consolidated sales information to CRM systems for better customer interaction.
- Real-time reporting and alert systems by integrating analyzed data back into business operations.

** *Challenges*:
- Data Consistency: Ensuring that the data quality and structure remain consistent across data transfer.
- Data Latency: Balancing between real-time data needs and the feasibility of processing and transferring large sets of data swiftly.
- Complexity and Maintenance: Managing the transformations and keeping the system up-to-date with data warehouse changes.

** *Connections*:
- *Data Warehousing*: Fundamental to Reverse ETL, as it acts as the central repository from which data is drawn.
- *ETL and [[id:015cb100-bd71-4e98-ae7f-03d547b048e5][ELT]] Processes*: Provides a framework necessary for data preparation before reverse ETL can occur.
- *Data Governance*: Crucial in maintaining the integrity and security of the data as it moves to and from different platforms.
3 changes: 0 additions & 3 deletions Content/20241101175831-cache.org
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,3 @@
#+title: Cache
#+filetags: :data:

- high speed memory taking advantage of the temporal locality of reference principle -> recenlty accessed data is likely to be accessed again.

- caches are a good first step towards improving a [[id:2f67eca9-5076-4895-828f-de3655444ee2][DataBase's]] performance under multiple accesses.
5 changes: 5 additions & 0 deletions Content/20241101181744-data_lakehouse.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
:PROPERTIES:
:ID: 796b4db7-42dc-4783-bb05-b15524ddf117
:END:
#+title: Data Lakehouse
#+filetags: :data:
5 changes: 5 additions & 0 deletions Content/20241101181803-object_storage.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
:PROPERTIES:
:ID: add20973-54a9-4d96-a938-b27ccbf9c1e6
:END:
#+title: Object Storage
#+filetags: :data:
6 changes: 6 additions & 0 deletions Content/20241101182156-service_level_agreement.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
:PROPERTIES:
:ID: 079db37b-925c-478a-836f-7f6ce8027108
:ROAM_ALIASES: SLA
:END:
#+title: Service Level Agreement
#+filetags: :bs:
5 changes: 5 additions & 0 deletions Content/20241101182746-metadata.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
:PROPERTIES:
:ID: 5c5245d1-4919-4e13-9232-410f324c0288
:END:
#+title: MetaData
#+filetags: :data:meta:
Loading

0 comments on commit 7cdbe46

Please sign in to comment.