night 0x2250

rajp152k · May 19, 2024 · 6a3b80c · 6a3b80c
1 parent 06301d2
commit 6a3b80c
Show file tree

Hide file tree

Showing 9 changed files with 296 additions and 14 deletions.
diff --git a/Content/20230715173535-data_structure.org b/Content/20230715173535-data_structure.org
@@ -5,6 +5,11 @@
 #+filetags: :programming:data:
 
 * Stream
+** 0x2250
+ - exploring consensus algorithms, some relevant structures
+   - [[id:20240519T201001.324666][Merkle Tree]] (also used in [[id:038e3720-0307-41d8-bcb1-e77b75a161df][Version Control]])
+   - [[id:20240519T213730.807988][SS Table]]
+   - [[id:20240519T214118.461723][Bloom Filter]]
 ** 0x21B6
  - initializing the first dedicated population phase
    - need a stable base for all specializations 

diff --git a/Content/20231229123304-hashmap.org b/Content/20231229123304-hashmap.org
@@ -1,7 +1,56 @@
 :PROPERTIES:
 :ID:       235113d9-983a-4782-a4e8-d027ba52d82b
 :END:
-#+title: HashMap
-#+filetags: :programming:data:
+#+title: Hashing
+#+filetags: :cs:programming:data:
+
+ - Hashmap: A [[id:20230715T173535.681936][Data Structure]] that is stores values indexable with hashed keys.
+ - Hash Function: the corresponding mapping mechanism yielding values for a hashed key.
+
+* Consistent Hashing
+:PROPERTIES:
+:ID:       20240519T215504.815957
+:END:
+
+ - technique used in [[id:a3d0278d-d7b7-47d8-956d-838b79396da7][distributed systems]] to distribute data across multiple nodes (servers) efficiently.
+ - Unlike traditional hashing, where a change in the number of nodes can disrupt the entire data distribution, consistent hashing minimizes the impact of adding or removing nodes.
+
+ - resources :
+   - https://www.toptal.com/big-data/consistent-hashing
+   - https://en.wikipedia.org/wiki/Consistent_hashing
+
+* Working mechanism
+
+1. The Hash Ring: Imagine a circular ring (the "hash ring"). Both data items and nodes are assigned positions on this ring using a hash function.
+
+2. Data Placement: Each data item is mapped to the first node encountered clockwise on the ring from its hash position. This node is responsible for storing and serving that data item.
+
+3. Node Addition: When a new node is added, its position on the ring is determined by hashing its identifier. Only the data items that were previously mapped to the nearest clockwise node now get remapped to the new node.
+
+4. Node Removal: When a node is removed, the data items it was responsible for are remapped to the next available node in the clockwise direction.
+
+* Benefits
+
+ - Minimizes Remapping: When nodes are added or removed, only a small fraction of the data needs to be remapped, reducing disruption to the system.
+
+ - Balanced Load Distribution: Data is distributed evenly across nodes, preventing any single node from becoming a bottleneck.
+
+ - Scalability: Easily scales horizontally by adding more nodes to the ring.
+
+* Use Cases
+
+ - [[id:a3d0278d-d7b7-47d8-956d-838b79396da7][Distributed]] [[id:c8a3e246-0f29-4909-ab48-0d34802451d5][Caching]]: Used in systems like Memcached and Redis to distribute cache data across multiple servers.
+
+ - [[id:0d7c2dea-a250-4380-b826-ad4d2547d8d6][Load Balancing]]: Used to distribute requests across multiple web servers or application servers.
+
+ - Distributed Hash Tables (DHTs): Used in [[id:20240519T201442.376294][peer-to-peer]] systems to locate data stored across a network of nodes.
+
+* Key Considerations
+
+ - Hash Function Choice: Choosing a good hash function is crucial for ensuring even data distribution.
+
+ - Virtual Nodes: To further improve load balancing, each physical node can be represented by multiple "virtual nodes" on the hash ring.
+
+ - Data Replication: Consistent hashing can be combined with replication strategies to ensure data availability even if nodes fail.
+
 
-A [[id:20230715T173535.681936][Data Structure]] that is stores values indexable with hashed keys.
diff --git a/Content/20240519152842-cap.org b/Content/20240519152842-cap.org
@@ -12,6 +12,7 @@
 You can only guarantee two out of three properties:
 
  - Consistency (C): Everyone sees the same data at the same time.
+   - also see [[id:20240519T221608.054348][Eventual Consistency]]
 
  - Availability (A): The system is always up and running, even with some failures.
 

diff --git a/Content/20240519162542-fault_tolerence.org b/Content/20240519162542-fault_tolerence.org
@@ -11,33 +11,35 @@ Also see [[id:45753302-58fd-4cb1-bff6-f8843aee5708][Chaos Engineering]]
 
 Fault tolerance is essential for critical systems where downtime or data loss can have severe consequences, such as financial systems, healthcare applications, and online services. It helps ensure that these systems remain operational and accessible even in the face of unexpected failures.
 
-* Overview
-** Purpose
+* Purpose
  - To prevent failures from causing complete system downtime or data loss.
-** Types of Failures
+* Types of Failures
   - [[id:a9430614-4e6e-41ff-9788-0f51c2867e74][Hardware]]: Server crashes, disk failures, power outages.
   - [[id:d9a3aabe-114b-43c6-81f9-ca6e01ed3f46][Software]]: Bugs, crashes, security vulnerabilities.
   - [[id:a4e712e1-a233-4173-91fa-4e145bd68769][Network]]: Lost connections, congestion, cyberattacks.
   - niques:
-** Detecting Failures
+* Failure Detection
+:PROPERTIES:
+:ID:       20240519T222806.511836
+:END:
 
  - identifying when a component (e.g., server, network link) in a distributed system stops functioning correctly.
 
  - essential for triggering recovery mechanisms like failover, replication, or reconfiguration.
 
-*** Challenges
+** Challenges
 
  - Network delays and partitions can make it difficult to distinguish between slow responses and actual failures.
 
  - False positives (mistakenly declaring a node as failed) and false negatives (failing to detect a real failure) can have serious consequences.
 
-*** Types of Failures
+** Types of Failures
 
  - Crash Failures: A node stops responding completely.
  - Omission Failures: A node fails to send or receive messages.
  - Byzantine Failures: A node behaves arbitrarily or maliciously.
 
-*** Failure Detection Mechanisms
+** Failure Detection Mechanisms
 
  - Heartbeating: Nodes periodically send "I'm alive" messages to a central monitor or each other. Absence of heartbeats indicates a potential failure.
 
@@ -47,28 +49,28 @@ Fault tolerance is essential for critical systems where downtime or data loss ca
 
  - Timeout-Based Detection: If a node doesn't respond within a certain time, it's assumed to have failed.
 
-*** Failure Detector Properties
+** Failure Detector Properties
 
  - Completeness: Every actual failure is eventually detected.
 
  - Accuracy: No non-faulty node is incorrectly suspected of failing.
 
  - Speed: Failures are detected quickly.
 
-*** Practical Considerations
+** Practical Considerations
 
  - Trade-offs must be made between accuracy and speed.
 
  - Timeout values and other parameters need to be adjusted based on network conditions and application requirements.
 
  - Failure detection is often probabilistic - It provides a likelihood of failure rather than absolute certainty.
-** Strategies
+* Strategies
   - [[id:262874ff-9248-485d-91ee-f7ca1dc2c31d][Redundancy]]: Having multiple copies of critical components (e.g., servers, disks).
   - Replication: Copying data to multiple locations for backup.
   - Failover: Automatically switching to a backup component when a primary one fails.
   - Error Detection and Correction: Identifying and fixing errors in data or software.
   - [[id:0d7c2dea-a250-4380-b826-ad4d2547d8d6][Load Balancing]]: Distributing workload to prevent overload and improve performance.
-** Benefits
+* Benefits
  - Increased Reliability: Reduces the risk of system failures and downtime.
  - High Availability: Ensures critical applications and services remain accessible even with failures.
  - Data Protection: Safeguards against data loss due to hardware or software malfunctions.

diff --git a/Content/20240519201442-p2p_protocol.org b/Content/20240519201442-p2p_protocol.org
@@ -1,5 +1,6 @@
 :PROPERTIES:
 :ID:       20240519T201442.376294
+:ROAM_ALIASES: peer-to-peer
 :END:
 #+title: P2P Protocol
 #+filetags: :cs:

diff --git a/Content/20240519213730-sstree.org b/Content/20240519213730-sstree.org
@@ -0,0 +1,55 @@
+:PROPERTIES:
+:ID:       20240519T213730.807988
+:END:
+#+title: SS Table
+#+filetags: :data:cs:
+
+ - SSTable, short for Sorted String Table, is a simple yet powerful data structure used in many storage systems, including NoSQL databases like Cassandra and LevelDB. It's a fundamental building block for efficient data storage and retrieval.
+
+ - versatile and efficient [[id:20230715T173535.681936][data structure]] that plays a crucial role in many storage systems, particularly those that require high performance and scalability.
+
+* Key Characteristics
+
+ - Sorted: Data within an SSTable is sorted by keys, making it easy to find specific records using binary search.
+
+ - Immutable: Once created, an SSTable cannot be modified. This immutability simplifies concurrency control and allows for efficient merging of multiple SSTables.
+
+ - Persistent: SSTables are typically stored on disk for long-term persistence.
+
+* Structure
+
+Consists of several components:
+
+ - Data Block: Contains a sequence of key-value pairs sorted by key.
+
+ - Index Block: Stores the offset of the first key in each data block, enabling quick lookup of the data block containing a specific key.
+
+ - [[id:20240519T214118.461723][Bloom Filter]]: A probabilistic data structure that helps determine whether a key might exist in the SSTable, avoiding unnecessary disk reads.
+
+* Working Mechanism
+
+1. Data Ingestion: Incoming key-value pairs are written to a memory-based data structure (e.g., Memtable in Cassandra).
+
+2. Flushing to Disk: When the Memtable reaches a certain size, it is flushed to disk as an SSTable.
+
+3. Compaction: Multiple SSTables are merged periodically to remove duplicates and deleted data, improving read performance.
+
+4. Read Operations: When a key is requested, the index block is searched to find the relevant data block, which is then read from disk. The Bloom filter is used to quickly determine if the key might be present before reading the data block.
+
+* Advantages
+
+ - Efficient Reads: Sorted structure enables fast lookups using binary search.
+
+ - Efficient Writes: Writes are sequential and batched, minimizing disk seeks.
+
+ - Space Efficiency: Compaction eliminates duplicate and deleted data.
+
+ - Scalability: Can be easily distributed across multiple nodes for large datasets.
+
+* Use Cases
+
+ - NoSQL Databases: Cassandra, LevelDB, and RocksDB heavily rely on SSTables for data storage.
+
+ - Key-Value Stores: Many key-value stores use SSTables as their underlying storage format.
+
+ - Search Engines: SSTables can be used to store inverted indexes for fast search operations.
diff --git a/Content/20240519214118-bloom_filter.org b/Content/20240519214118-bloom_filter.org
@@ -0,0 +1,51 @@
+:PROPERTIES:
+:ID:       20240519T214118.461723
+:END:
+#+title: Bloom Filter
+#+filetags: :cs:data:
+
+
+ - a space-efficient [[id:91b6fb5d-6447-43fe-8412-2054bb79979a][probabilistic]] [[id:20230715T173535.681936][data structure]] designed to quickly determine if an element is a member of a [[id:c1a12380-9aad-4969-8b6a-cfceebfa984f][set]].
+ - a handy tool when need to check if something exists without searching the entire set of elements.
+ - valuable tool for applications that need fast membership tests and can tolerate a small chance of false positives.
+ - their space efficiency makes them particularly useful for large datasets.
+
+* Working Mechanism
+
+1. Bit Array: A Bloom filter starts with an empty bit array (a sequence of 0s) of a fixed size.
+
+2. Hash Functions: It uses multiple hash functions to map elements to positions in the bit array.
+
+3. Adding Elements: 
+   - When you add an element, it's hashed with each of the hash functions.
+   - The bits at the resulting positions in the array are set to 1.
+
+4. Checking for Membership:
+   - To check if an element is in the set, hash it with each of the hash functions.
+   - If all the resulting positions in the array contain 1s, then the element is probably in the set.
+   - If any of the positions contain 0s, then the element is definitely not in the set.
+
+* Key Properties
+
+ - Probabilistic: Bloom filters can have false positives (incorrectly saying an element is present) but never false negatives (incorrectly saying an element is absent).
+
+ - Space-Efficient: They use much less space than storing the entire set of elements, especially for large sets.
+
+ - Fast: Lookups are extremely fast, as they only involve computing hash functions and checking a few bits.
+
+* Use Cases
+
+ - [[id:c8a3e246-0f29-4909-ab48-0d34802451d5][Cache]] Filtering: Check if a requested item is in the cache before fetching it from a slower storage.
+
+ - Duplicate Detection: Quickly identify duplicate items in a stream of data.
+
+ - Spell Checkers: Check if a word is misspelled by comparing it to a dictionary represented as a Bloom filter.
+
+ - [[id:a4e712e1-a233-4173-91fa-4e145bd68769][Network]] Routers: Filter out known malicious IP addresses to protect against attacks.
+
+* Limitations
+
+ - False Positives: Bloom filters have a small probability of false positives, which can be controlled by adjusting their size and the number of hash functions.
+
+ - Cannot Delete: Removing elements is not directly supported because a single bit might correspond to multiple elements.
+
diff --git a/Content/20240519221608-eventual_consistency.org b/Content/20240519221608-eventual_consistency.org
@@ -0,0 +1,109 @@
+:PROPERTIES:
+:ID:       20240519T221608.054348
+:END:
+#+title: Eventual Consistency
+#+filetags: :cs:
+
+* Gossip Protocols
+:PROPERTIES:
+:ID:       20240519T222301.158791
+:END:
+
+ - also known as epidemic protocols,
+ - communication mechanisms used in distributed systems where nodes share information with each other in a decentralized manner.
+ - mimic the way gossip spreads in social networks, where individuals share news with their friends, who then share it with their friends, and so on.
+
+** Working Mechanism
+
+1. Random Peer Selection: Each node periodically selects a random subset of its peers (other nodes it's connected to) and initiates communication.
+
+2. Information Exchange: The nodes exchange information about their state, including data, updates, or events they've observed.
+
+3. Propagation: The received information is then shared with other randomly selected peers, gradually disseminating throughout the network.
+
+** Key Features
+
+ - [[id:3c0c2077-b24a-4f6b-b93f-f06c08f7b3e9][Decentralized]]: No central coordinator or leader controls the communication.
+
+ - Scalable: Works well in large-scale systems with thousands of nodes.
+
+ - Robust: Tolerant to node failures and network partitions.
+
+ - Eventual Consistency: Information eventually reaches all nodes, but there's no guarantee on how long it will take.
+
+** Types of Gossip Protocols
+
+ - Push Gossip: A node actively pushes its information to randomly selected peers.
+
+ - Pull Gossip: A node requests information from randomly selected peers.
+
+ - Push-Pull Gossip: A combination of both, where nodes both push and pull information.
+
+** Use Cases
+
+ - [[id:20240519T222806.511836][Failure Detection]]: Nodes can gossip about their health status, allowing the system to detect failures quickly.
+
+ - Data Dissemination: Used to spread data updates or events across the network.
+
+ - Peer Sampling: Nodes can discover other nodes in the network by gossiping about their neighbors.
+
+ - Aggregate Computation: Nodes can compute aggregates (e.g., average, sum) by gossiping partial results.
+
+** Advantages
+
+ - Scalability: Handles large networks with thousands of nodes efficiently.
+
+ - [[id:20240519T162542.805560][Fault Tolerance]]: Can withstand node failures and network partitions.
+
+ - Simplicity: Relatively simple to implement and understand.
+
+ - Low Overhead:  Doesn't require a central coordinator, reducing communication overhead.
+
+** Disadvantages
+
+ - Eventual Consistency: Not suitable for applications requiring strong consistency.
+
+ - Latency: Can take some time for information to propagate to all nodes.
+
+ - Redundant Messages: Can result in redundant messages being sent due to the random nature of peer selection.
+* Hinted Handoff
+:PROPERTIES:
+:ID:       20240519T221942.851343
+:END:
+
+ - helps ensure that data updates eventually reach all replicas, even when some nodes are temporarily unavailable.
+
+ - is a key mechanism in [[id:20240519T221905.005300][Cassandra]] that helps to bridge the gap between [[id:20240519T152842.050227][availability and eventual consistency]]
+
+ - by temporarily storing data updates for unavailable replicas, it ensures that writes are not lost and that all replicas eventually converge to the same state.
+
+ - This makes Cassandra a robust and reliable choice for applications that prioritize availability and can tolerate eventual consistency.
+
+** How Hinted Handoff Works
+
+1. Write Request: When a write request is sent to a Cassandra node (the coordinator), it forwards the request to the replicas responsible for storing that data.
+
+2. Unavailable Replica: If one or more replicas are unavailable (e.g., due to network issues or maintenance), the coordinator cannot immediately write the data to them.
+
+3. Hint Creation: Instead of failing the write, the coordinator stores a "hint" locally. This hint contains the data that needs to be written and the address of the unavailable replica.
+
+4. Handoff: When the unavailable replica comes back online, it contacts the coordinator and requests any hints that were stored for it.
+
+5. Hint Replay: The coordinator sends the stored hints to the replica, which then applies the missed writes, eventually catching up with the rest of the cluster.
+
+** Benefits
+
+ - Increased Write Availability: Even if some replicas are down, writes can still succeed as long as a quorum of replicas is available.
+
+ - Eventual Consistency: Hinted handoff ensures that all replicas eventually receive the updates, maintaining data consistency over time.
+
+ - Reduced Client Retries: Clients don't need to constantly retry failed writes since the hints will be replayed automatically.
+
+** Key Considerations
+
+ - Hint Lifetime: Hints are not stored indefinitely. They have a configurable lifetime, after which they are discarded if the replica remains unavailable.
+
+ - Hint Storage: Hints are typically stored on disk, which can impact disk usage if a node is down for an extended period.
+
+ - Handoff Overhead: Replaying hints can add some overhead to the system, but this is usually a minor cost compared to the benefits of improved availability and consistency.
+
diff --git a/Content/20240519221905-cassandra.org b/Content/20240519221905-cassandra.org
@@ -0,0 +1,9 @@
+:PROPERTIES:
+:ID:       20240519T221905.005300
+:END:
+#+title: Cassandra
+#+filetags: :cs:data:
+
+ - [[id:a3d0278d-d7b7-47d8-956d-838b79396da7][distributed]] [[id:2f67eca9-5076-4895-828f-de3655444ee2][database]] (Key Value Store)
+ - also see [[id:20240519T221942.851343][Hinted Handoffs]] 
+