-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
14 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
<!doctype html><html lang=en><head><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta name=description content="A repository for my thoughts"><link rel=apple-touch-icon sizes=180x180 href=/apple-touch-icon.png><link rel=icon type=image/png sizes=32x32 href=/favicon-32x32.png><link rel=icon type=image/png sizes=16x16 href=/favicon-16x16.png><link rel=icon type=image/x-icon href=/favicon.ico><link rel=stylesheet href=/css/style.min.css><meta name=title property=”og:title” content="through the looking glass of block layers | Haile (ሐይሌ)"><meta name=twitter:card content="summary"><meta name=twitter:title content="through the looking glass of block layers | Haile (ሐይሌ)"><meta name=description content><meta property="og:description" content><meta name=twitter:description content><meta name=twitter:creator content="@hailelagi"><title>through the looking glass of block layers</title></head><body><header id=banner><h2><a href=https://www.hailelagi.com/>Haile (ሐይሌ)</a></h2><nav><ul><li><a href=/bookshelf title=bookshelf>bookshelf</a></li><li><a href=https://www.github.com/hailelagi title=github>github</a></li><li><a href=/notes title=writing>writing</a></li></ul></nav></header><main id=content><article><header id=post-header><h1>through the looking glass of block layers</h1><time>December 6, 2024</time><meta name=twitter:card content="summary"> | ||
<meta property="og:url" content="https://www.hailelagi.com/"><meta property="og:image" content="/favicon-32x32.png"><meta itemprop=image content="/favicon-32x32.png"><meta name=twitter:image content="/favicon-32x32.png"><meta name=twitter:image:src content="/favicon-32x32.png"></header><p>The modern computing/data infrastructure is <a href=https://landscape.cncf.io/>vast and interesting</a>. | ||
Let’s explore a tiny slice of it, what happens when you read or write some data <strong>persistently</strong> on a modern cloud provider? | ||
Let’s conceptually trace our way up the block layers and see where data goes by writing a filesystem ???</p><div class=callout-warning>💡 | ||
All problems in comp sci. can be solved by another level of indirection.</div><p>Why a filesystem? It’s <strong>a key abstraction</strong> we’ll use to go spelunking into the lifecycle of a block destined for persistence, and of course we’ll explore ideas from more sophisticated filesystems like xfs, zfs, ext4 and discuss key ideas and tradeoffs and at the end some practical implications on kubernetes! Like all abstractions we begin not by looking at the implementation we look at the <em>interfaces</em>.</p><h2 id=physical-layer>Physical Layer</h2><p>At the bottom, there must exist some <em>physical media</em> which will hold these bits and bytes we conveniently call a block. It could be an HDD, SSD, <a href=https://aws.amazon.com/storagegateway/vtl/>tape</a> or something else, <a href=https://pages.cs.wisc.edu/~remzi/OSTEP/file-devices.pdf>what interface does this physical media present?</a> It’s exposed over many <em>protocols</em>.</p><svg viewBox="0 0 350 430" style="font-family:Arial,sans-serif"><line x1="125" y1="30" x2="125" y2="400" stroke="gray" stroke-width="2" marker-end="url(#arrowhead)"/><rect x="25" y="20" width="200" height="50" fill="#e6f2ff" stroke="#000" rx="5"/><text x="125" y="40" text-anchor="middle" font-size="10" font-weight="bold">Application Process</text><text x="125" y="52" text-anchor="middle" font-size="8">(read/write)</text><rect x="25" y="90" width="200" height="50" fill="#cce5ff" stroke="#000" rx="5"/><text x="125" y="110" text-anchor="middle" font-size="10" font-weight="bold">POSIX</text><text x="125" y="122" text-anchor="middle" font-size="8">(open, read, write, close)</text><rect x="25" y="160" width="200" height="50" fill="#b3d9ff" stroke="#000" rx="5"/><text x="125" y="180" text-anchor="middle" font-size="10" font-weight="bold">Filesystem</text><text x="125" y="192" text-anchor="middle" font-size="8">(files and directories)<- we're="" here!!="" </text=""><rect x="25" y="230" width="200" height="50" fill="#9cf" stroke="#000" rx="5"/><text x="125" y="250" text-anchor="middle" font-size="10" font-weight="bold">Block Interface</text><text x="125" y="262" text-anchor="middle" font-size="8">(read/write)</text><rect x="25" y="300" width="200" height="50" fill="#80bfff" stroke="#000" rx="5"/><text x="125" y="320" text-anchor="middle" font-size="10" font-weight="bold">Device Drivers</text><text x="125" y="332" text-anchor="middle" font-size="8">(specific read/write)</text><rect x="25" y="370" width="200" height="50" fill="#66b3ff" stroke="#000" rx="5"/><text x="125" y="390" text-anchor="middle" font-size="10" font-weight="bold">Physical Media</text><text x="125" y="402" text-anchor="middle" font-size="8">(HDD/SSD - sector/page r/w)</text><defs><marker id="arrowhead" markerWidth="10" markerHeight="7" refX="0" refY="3.5" orient="auto"><polygon points="0 0, 10 3.5, 0 7" fill="gray"/></marker></defs></svg> | ||
This is a rough sketch made for simplicity.<p>An HDD exposes a “flat” address space to read or write, the smallest atomic unit is a sector (512-byte block) and flash based | ||
SSDs expose a unit called a “page” which we can read or write higher level “chunks” of. [†1] to create a <em>file system abstraction</em> over this <strong>block interface</strong>, what does it look like?</p><p>We have quite a few flavors, a few highlights for linux:</p><ol><li><a href=https://linux-kernel-labs.github.io/refs/heads/master/labs/block_device_drivers.html#overview>The internal Kernel Block Device Layer</a></li><li><a href=https://spdk.io/doc/ublk.html>ublk</a></li><li><a href=https://www.kernel.org/doc/html/next/filesystems/fuse.html>FUSE</a></li><li><a href=https://libvirt.org/storage.html>libvirt</a></li></ol><p>As it turns out a filesystem is historically a sub-component of the operating system! However there’s all these interesting <em>usecases</em> for writing all sorts of different <em>kinds of filesystems</em> which make different <em>design decisions</em> at different layers, wouldn’t it be nice to not brick yourself mounting some random filesystem I made? How about an <em>EC2 instance</em>? or a docker container? now that <em>virtualisation</em> technology is ubiquitous how does that change the interface? anyway, I’m picking FUSE - file system in userspace back up to filesystems!</p><h3 id=a-file-system>A File system</h3><p>An interface/sub-system that allows the management of blocks + block devices on disk via abstractons, provides files and directories. | ||
One layout could be:</p><pre tabindex=0><code>++++++++++++++++++++++++++++++++++++++++++ | ||
+ superblock + inode-table + user data! + | ||
++++++++++++++++++++++++++++++++++++++++++ | ||
</code></pre><p>Data structures:</p><ol><li>the file (Index-Node(INode))</li><li>The directory (self <code>.</code>, parent <code>..</code>, etc)</li><li>access methods: open(), read(), write(), fstat() etc</li><li>super block - metadata about other metadata (inode count, fs version, etc)</li></ol><h2 id=design-choicestradeoffs>Design choices/tradeoffs</h2><ul><li>Tree vs Array</li><li>Bitmap index vs free list vs Btree</li><li>Indexing non-contiguous layout (pointers vs extents)</li><li>static vs dynamic partitioning</li><li>Block size</li></ul><h3 id=problems>Problems</h3><ul><li>Latent sector errors</li><li>Misdirected IO</li><li>Disk corruption (physical media - heat etc)</li><li>Fragmentation</li></ul><h3 id=disk-io-schedulingschedulers>Disk IO scheduling/schedulers</h3><ul><li>SSTF</li><li>NBF</li><li>SCAN vs C-SCAN (elevator algorithm)</li><li>SPTF</li></ul><p>linux: <a href=https://wiki.ubuntu.com/Kernel/Reference/IOSchedulers>https://wiki.ubuntu.com/Kernel/Reference/IOSchedulers</a></p><h3 id=raid>RAID</h3><p>transparently map logical IO to physical IO for fault-tolerance(fail-stop model) and performance.</p><ul><li>stripping</li><li>mirroring</li><li>parity</li></ul><h2 id=references--notes>References & Notes</h2><p>[†1]: Although the smallest unit of a flash is actually a cell, and a write/erase may touch on the block, for simplicity and rough equivalence these are equated.</p></article></main><footer id=footer><a href=https://github.com/hailelagi/blog>Copyright © 2024 Haile Lagi</a><div><span>private inquiries: hailelagi[at]gmail.com</span></div><div><span>or informally(twitter/x): https://x.com/haile_lagi</span></div></footer></body></html> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.