-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When exporting, use hardlinks for duplicated files #3060
Conversation
Hi @owtaylor. Thanks for your PR. I'm waiting for a ostreedev member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! This looks sane to me.
} | ||
else | ||
{ | ||
g_hash_table_insert (seen_checksums, (char *)checksum, g_object_ref (path)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be worth a comment like:
/* The checksum string is owned by the repo file object */
or so; I paused for a few seconds here thinking about the memory management.
tests/libtest.sh
Outdated
@@ -249,6 +249,9 @@ setup_test_repository () { | |||
mkdir baz/another/ | |||
echo x > baz/another/y | |||
|
|||
echo "SAME_CONTENT" > baz/duplicate_a | |||
echo "SAME_CONTENT" > baz/duplicate_b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add a test case for hardlinks across directories?
/ok-to-test |
991555d
to
7c98aed
Compare
Pushed a new version with clang-format applied. [EDIT: or maybe not. Will push a new version in a few minutes with your requested changes too] |
The other alternative would be to walk the tree twice; the first time to find hardlinked objects. That may be a good CPU/memory tradeoff. |
You mean, the first time, keep a checksum => seen count hash table, then use that to save only selected OstreeRepoFile objects on the second pass? The maximum savings over the "save the paths" approach is <average length of path * number of files in tree> - for a very large tree (million files), that might be around 100 MiB? I'm less sure about the savings compared to the "save the OstreRepoFiles" approach. I think I'd rather avoid the complexity and the chance of getting it wrong, but if you feel strongly, I'm happy to do it that way too. |
Definitely let's keep what you've tested, it's in the merge queue! |
Head branch was pushed to by a user without write access
7c98aed
to
cef1a98
Compare
For ostree_repo_export_tree_to_archive(), and 'ostree export', when the exported tree contains multiple files with the same checksum, write an archive with hard links. Without this, importing a tree, then exporting it again breaks hardlinks. As an example of savings: this reduces the (compressed) size of the Fedora Flatpak Runtime image from 1345MiB to 712MiB. Resolves: ostreedev#2925
cef1a98
to
3b2fd6e
Compare
For ostree_repo_export_tree_to_archive(), and 'ostree export', when the exported tree contains multiple files with the same checksum, write an archive with hard links.
Without this, importing a tree, then exporting it again breaks hardlinks.
As an example of savings: this reduces the (compressed) size of the Fedora Flatpak Runtime image from 1345MiB to 712MiB.
Resolves: #2925
As noted in #2925, if this is considered insufficiently compatible, it could be put behind an option. If someone is untarring an 'ostree export' tarball and then using it read-write, then hardlinking coincidentally the same files (like all empty files) could be quite surprising. For typical usage of ostree, however, using hardlinks in the exported tar is less surprising.
There's a fair bit of memory usage during export from keeping all the OstreeRepoFile objects. Making the hash table be 'checksum => path string' might save some of that, though it would depend on the paths. I'm not sure which way is better.