Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fallocate: How to quickly create / convert unwritten extents into written extents? #136

Open
pPanda-beta opened this issue Mar 18, 2023 · 3 comments

Comments

@pPanda-beta
Copy link

Currently if we create files using fallocate -l <size> <file_path> it will create the file but with unwritten extents. This is better than "holes" where no physical blocks are allocated. But it is still problematic in the sense that when this file will be read, physical sectors will never be read.

Currently I'm doing some benchmarks, on a 302 TiB disk, with ~276 KiB sized files. I need to fill the disk quickly without writing any data to those micro files. The underlying disk plays a big role here, at the time of benchmark, block io read operation must happen. Unfortunately ext4 fs driver from linux kernel will send 0s instead of reading the physical sectors.

I found EXT2_FALLOCATE_FORCE_INIT, but fcntl's iocall doesn’t support that flag.

Also will it be fine if I traverse through all inodes, then the extents and remove the flag from extent.e_pblk ?

P.S. In my use case security is not a concern. Consider the user space program is running with sudo. In regular environment this can be a little worrying since, previously written data (unlinked but not shredded) can be read using this trick.

@tytso
Copy link
Owner

tytso commented Mar 19, 2023

There are some out-of-tree patches floating around that add support for FALLOC_FL_NO_HIDE_STALE, which will allow you to fallocate extents with the initialized flag set, but without first clearing the blocks first. It is very handy if you are doing benchmarks, or when you implicitly trust the userspace application. For example, if you have a object storage and retrieval system which is used as part of a cluster file system and you know that data is always stored at rest encrypted in a key which is per-user, then stale data might not be an issue, and so being able to fallocate space that is marked as initialzied might be really handy.

Unfortunately, the problem is that having such a scheme can be an "attractive nuisance", and the file system engineers who worked at large distributions very much against such a patch being accepted upstream, because essentially, they didn't trust their customers not to abuse such a feature, since it can make such a huge difference from a performance perspective. But then if there was a massive security exposure, it would be a support nightmare for their help desk, and it might damage their (and Linux's) reputation. So in the face of very strong opposition, this patch never went upstream. However, there are some hyperscale computing companies that have essentially re-invented the patch. For example, here is one from TaoBao (the Chinese equivalent of Ebay)[1].

[1] https://lore.kernel.org/all/[email protected]/

The patch that we had been using internally at Google had a mount option which allowed us to specify a group-id which a process had to be a member of in order to use the FALLOC_FL_NO_HIDE flag. So this was more secure than the TaoBao patch, but admittedly, specifying a group id on the mount option was a bit of a hack. This also pre-dated userid and groupid namespaces, so this scheme isn't compatible with modern container systems like Docker. It worked just fine for our internal Borg (an earlier version of what evolved to become Kubernetes) system, though.

But, yes, if you are willing to unmount the file system, it wouldn't be hard to write an program using libext2fs to iterate through all of the extents of all inodes, and clear the flag.

If you just need it for a specialized benchmarking exercise, and all of the files are the same size, you might be to experiment with this mke2fs feature. Put the following in the [fstypes] section:

hugefiles = {
        features = extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2
        hash_alg = half_md4
        reserved_ratio = 0.0
        num_backup_sb = 0
        packed_meta_blocks = 1
        make_hugefiles = 1
        inode_ratio = 4194304
        hugefiles_dir = /storage
        hugefiles_name = chunk-
        hugefiles_digits = 5
        hugefiles_size = 4G
        hugefiles_align = 256M
        hugefiles_align_disk = true
        zero_hugefiles = false
        flex_bg_size = 262144
    }

Then when you run the command mke2fs -t hugefiles /dev/sdc this will create a file system on /dev/sdc which is filled with files of the form /storage/chunk-00001, /storage/chunk-00002, etc. Each of these files with be 4GiB in size, and they will be aligned to a 256 MiB boundary.

For your use case, you might want to set hugefiles_size to 176k, and drop the hugefile_align and hugefile_align_disk lines. All of the details of how this work can be found in the mke2fs.conf file.

We've since found that the make_hugefiles feature in mke2fs is sufficient for our needs, so we've actually stopped maintainging the FALLOC_FL_NO_HIDE_STALE out-of-tree patch in our kernels.

@pPanda-beta
Copy link
Author

Thanks @tytso for the insights.

In my opinion to mitigate the security risk, we should approach this from the mount options. Whoever has full read-write access to the disk (or disk image file) can only give this permission.

For container systems, although inside container anyone can have their own user id and group id (even one can easily become root), but to perform such fallocate (e.g. fallocate(fd, FALLOC_FL_NO_HIDE_STALE, 0, size)) the file system has to be mounted with certain options. This capability mostly stays with the host OS. In case a container is mounting a fs from a disk, that means container has read/write access to the disk. In short whoever owns the disk, can only initiate such operation.

Even with my current bruteforce solution (go by all extents and mark them written), I was only able to do it since I had full read/write access to the underlying disk.

hugefiles - thanks for the suggestion, in my case it may not work.
I want to benchmark a system where, a certain setup of a disk is present, on top of that ext4 fs will be there, and there is another component which will read the fs. So regular ext4 is a mandatory parameter for me.
I had a hunch, that on fragmented filesystem, this entire system struggles and gives very low throughput. So I used fallocate and synthesised file fragmentation. But because of the kernel ext4 fs driver, unwritten extents were never read from underlying media, and the speed was even better than regular use cases. Yes, I got fooled.

After some investigation I stumbled upon this from the filefrag utility.

Also I would like to mention another problem that I faced (although it is out of the scope of this issue). When I applied the brute force approach I used a multi threaded (beyond hw concurrency) approach, since different extent headers are written in different locations of disk. Unfortunately I found it corrupts the file system. Then I had to implement some kind of synchronization to prevent parallel write operations to fs.
For the same I tried creating a new io_manager overriding the write functions of unix_io_manager but it didn’t work, since open_channel uses its own write functions.

On GCE VMs, PDs (both SSD and extreme PD) are worse for sequential IO operation. They work the best when multiple IO operations are done in bulk. Marking extents one by one is really a slow process (multi threading helped in reading but not writing). But I must admit this is way faster than filling each file with data.

Btw, after a perfect case of file fragmentation, the above mentioned system is struggling a lot. (so I got some of my reputations back 😜 )

@tytso
Copy link
Owner

tytso commented Mar 29, 2023

In terms of upstream support for FALLOC_FL_NO_HIDE, it's not so much a security risk but a political issue. Fundamentally, the enterprise linux distributions don't trust their customers, and don't think they can withstand the customer demand for enabling a feature, even if it were under an #ifdef, and if it is misused by people who don't know what they are doing, it could lead to stale data exposure leading to bad press when people's social security numbers are leaked to the world. Ultimately, it's about the enterprise distro folks thinking they know better than their users. (And having worked at a large big blue company that catered to this part of the market segment, I can understand their concern.) So the best I could do we to reserve the codepoint for FALLOC_FL_NO_HIDE, and the patches to enable it were left as an out-of-tree patch. To be frank, I haven't updated that patchset in a while, since we no longer need it at $WORK, as we now use a raw block device for our cluster file system, instead of using ext4 as our back-end object store.

As far as support for parallelism in libext2fs, there is the beginning of some parallelism support in unix_io, but it's optimized for reads (the use case is parallel fsck, and most of the time this can be done as a read-only workload). If you are only updating the extent headers in place, and not making any other changes, it should be possible to do that safely, since you would be updating each block in place. If you do anything else, there is a lot of caching that takes place in the ext2_filsys file handle, and that's going to be problematic. But if you have multiple threads sharing the ext2_fs file handle, and they share it, and each thread only operates on a single inode and is only updating the extent header in place, it should be OK. But it's definitely tricky since there is no locking on the various shared data structures. We do read the block bitmap in parallel --- see lib/ext2fs/rw_bitmap.c in the latest e2fsprogs sources, but that's really about it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants