-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fallocate: How to quickly create / convert unwritten extents into written extents? #136
Comments
There are some out-of-tree patches floating around that add support for FALLOC_FL_NO_HIDE_STALE, which will allow you to fallocate extents with the initialized flag set, but without first clearing the blocks first. It is very handy if you are doing benchmarks, or when you implicitly trust the userspace application. For example, if you have a object storage and retrieval system which is used as part of a cluster file system and you know that data is always stored at rest encrypted in a key which is per-user, then stale data might not be an issue, and so being able to fallocate space that is marked as initialzied might be really handy. Unfortunately, the problem is that having such a scheme can be an "attractive nuisance", and the file system engineers who worked at large distributions very much against such a patch being accepted upstream, because essentially, they didn't trust their customers not to abuse such a feature, since it can make such a huge difference from a performance perspective. But then if there was a massive security exposure, it would be a support nightmare for their help desk, and it might damage their (and Linux's) reputation. So in the face of very strong opposition, this patch never went upstream. However, there are some hyperscale computing companies that have essentially re-invented the patch. For example, here is one from TaoBao (the Chinese equivalent of Ebay)[1]. [1] https://lore.kernel.org/all/[email protected]/ The patch that we had been using internally at Google had a mount option which allowed us to specify a group-id which a process had to be a member of in order to use the FALLOC_FL_NO_HIDE flag. So this was more secure than the TaoBao patch, but admittedly, specifying a group id on the mount option was a bit of a hack. This also pre-dated userid and groupid namespaces, so this scheme isn't compatible with modern container systems like Docker. It worked just fine for our internal Borg (an earlier version of what evolved to become Kubernetes) system, though. But, yes, if you are willing to unmount the file system, it wouldn't be hard to write an program using libext2fs to iterate through all of the extents of all inodes, and clear the flag. If you just need it for a specialized benchmarking exercise, and all of the files are the same size, you might be to experiment with this mke2fs feature. Put the following in the [fstypes] section:
Then when you run the command For your use case, you might want to set hugefiles_size to 176k, and drop the hugefile_align and hugefile_align_disk lines. All of the details of how this work can be found in the mke2fs.conf file. We've since found that the make_hugefiles feature in mke2fs is sufficient for our needs, so we've actually stopped maintainging the FALLOC_FL_NO_HIDE_STALE out-of-tree patch in our kernels. |
Thanks @tytso for the insights. In my opinion to mitigate the security risk, we should approach this from the mount options. Whoever has full read-write access to the disk (or disk image file) can only give this permission. For container systems, although inside container anyone can have their own user id and group id (even one can easily become root), but to perform such fallocate (e.g. Even with my current bruteforce solution (go by all extents and mark them written), I was only able to do it since I had full read/write access to the underlying disk.
After some investigation I stumbled upon this from the Also I would like to mention another problem that I faced (although it is out of the scope of this issue). When I applied the brute force approach I used a multi threaded (beyond hw concurrency) approach, since different extent headers are written in different locations of disk. Unfortunately I found it corrupts the file system. Then I had to implement some kind of synchronization to prevent parallel write operations to fs. On GCE VMs, PDs (both SSD and extreme PD) are worse for sequential IO operation. They work the best when multiple IO operations are done in bulk. Marking extents one by one is really a slow process (multi threading helped in reading but not writing). But I must admit this is way faster than filling each file with data. Btw, after a perfect case of file fragmentation, the above mentioned system is struggling a lot. (so I got some of my reputations back 😜 ) |
In terms of upstream support for FALLOC_FL_NO_HIDE, it's not so much a security risk but a political issue. Fundamentally, the enterprise linux distributions don't trust their customers, and don't think they can withstand the customer demand for enabling a feature, even if it were under an #ifdef, and if it is misused by people who don't know what they are doing, it could lead to stale data exposure leading to bad press when people's social security numbers are leaked to the world. Ultimately, it's about the enterprise distro folks thinking they know better than their users. (And having worked at a large big blue company that catered to this part of the market segment, I can understand their concern.) So the best I could do we to reserve the codepoint for FALLOC_FL_NO_HIDE, and the patches to enable it were left as an out-of-tree patch. To be frank, I haven't updated that patchset in a while, since we no longer need it at $WORK, as we now use a raw block device for our cluster file system, instead of using ext4 as our back-end object store. As far as support for parallelism in libext2fs, there is the beginning of some parallelism support in unix_io, but it's optimized for reads (the use case is parallel fsck, and most of the time this can be done as a read-only workload). If you are only updating the extent headers in place, and not making any other changes, it should be possible to do that safely, since you would be updating each block in place. If you do anything else, there is a lot of caching that takes place in the ext2_filsys file handle, and that's going to be problematic. But if you have multiple threads sharing the ext2_fs file handle, and they share it, and each thread only operates on a single inode and is only updating the extent header in place, it should be OK. But it's definitely tricky since there is no locking on the various shared data structures. We do read the block bitmap in parallel --- see lib/ext2fs/rw_bitmap.c in the latest e2fsprogs sources, but that's really about it for now. |
Currently if we create files using
fallocate -l <size> <file_path>
it will create the file but with unwritten extents. This is better than "holes" where no physical blocks are allocated. But it is still problematic in the sense that when this file will be read, physical sectors will never be read.Currently I'm doing some benchmarks, on a 302 TiB disk, with ~276 KiB sized files. I need to fill the disk quickly without writing any data to those micro files. The underlying disk plays a big role here, at the time of benchmark, block io read operation must happen. Unfortunately ext4 fs driver from linux kernel will send 0s instead of reading the physical sectors.
I found
EXT2_FALLOCATE_FORCE_INIT
, but fcntl's iocall doesn’t support that flag.Also will it be fine if I traverse through all inodes, then the extents and remove the flag from
extent.e_pblk
?P.S. In my use case security is not a concern. Consider the user space program is running with sudo. In regular environment this can be a little worrying since, previously written data (unlinked but not shredded) can be read using this trick.
The text was updated successfully, but these errors were encountered: