Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When exporting, use hardlinks for duplicated files #3060

Merged
merged 1 commit into from
Oct 3, 2023

Conversation

owtaylor
Copy link
Contributor

For ostree_repo_export_tree_to_archive(), and 'ostree export', when the exported tree contains multiple files with the same checksum, write an archive with hard links.

Without this, importing a tree, then exporting it again breaks hardlinks.

As an example of savings: this reduces the (compressed) size of the Fedora Flatpak Runtime image from 1345MiB to 712MiB.

Resolves: #2925

As noted in #2925, if this is considered insufficiently compatible, it could be put behind an option. If someone is untarring an 'ostree export' tarball and then using it read-write, then hardlinking coincidentally the same files (like all empty files) could be quite surprising. For typical usage of ostree, however, using hardlinks in the exported tar is less surprising.

There's a fair bit of memory usage during export from keeping all the OstreeRepoFile objects. Making the hash table be 'checksum => path string' might save some of that, though it would depend on the paths. I'm not sure which way is better.

@openshift-ci
Copy link

openshift-ci bot commented Sep 29, 2023

Hi @owtaylor. Thanks for your PR.

I'm waiting for a ostreedev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! This looks sane to me.

}
else
{
g_hash_table_insert (seen_checksums, (char *)checksum, g_object_ref (path));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be worth a comment like:

/* The checksum string is owned by the repo file object */ or so; I paused for a few seconds here thinking about the memory management.

tests/libtest.sh Outdated
@@ -249,6 +249,9 @@ setup_test_repository () {
mkdir baz/another/
echo x > baz/another/y

echo "SAME_CONTENT" > baz/duplicate_a
echo "SAME_CONTENT" > baz/duplicate_b
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a test case for hardlinks across directories?

@cgwalters
Copy link
Member

/ok-to-test

@owtaylor
Copy link
Contributor Author

owtaylor commented Sep 29, 2023

Pushed a new version with clang-format applied. [EDIT: or maybe not. Will push a new version in a few minutes with your requested changes too]

@cgwalters
Copy link
Member

There's a fair bit of memory usage during export from keeping all the OstreeRepoFile objects. Making the hash table be 'checksum => path string' might save some of that, though it would depend on the paths. I'm not sure which way is better.

The other alternative would be to walk the tree twice; the first time to find hardlinked objects. That may be a good CPU/memory tradeoff.

@cgwalters cgwalters enabled auto-merge September 29, 2023 17:00
@owtaylor
Copy link
Contributor Author

There's a fair bit of memory usage during export from keeping all the OstreeRepoFile objects. Making the hash table be 'checksum => path string' might save some of that, though it would depend on the paths. I'm not sure which way is better.

The other alternative would be to walk the tree twice; the first time to find hardlinked objects. That may be a good CPU/memory tradeoff.

You mean, the first time, keep a checksum => seen count hash table, then use that to save only selected OstreeRepoFile objects on the second pass?

The maximum savings over the "save the paths" approach is <average length of path * number of files in tree> - for a very large tree (million files), that might be around 100 MiB? I'm less sure about the savings compared to the "save the OstreRepoFiles" approach.

I think I'd rather avoid the complexity and the chance of getting it wrong, but if you feel strongly, I'm happy to do it that way too.

@cgwalters
Copy link
Member

Definitely let's keep what you've tested, it's in the merge queue!

auto-merge was automatically disabled September 29, 2023 17:27

Head branch was pushed to by a user without write access

For ostree_repo_export_tree_to_archive(), and 'ostree export', when the
exported tree contains multiple files with the same checksum, write an
archive with hard links.

Without this, importing a tree, then exporting it again breaks
hardlinks.

As an example of savings: this reduces the (compressed) size of the
Fedora Flatpak Runtime image from 1345MiB to 712MiB.

Resolves: ostreedev#2925
@cgwalters cgwalters merged commit befd844 into ostreedev:main Oct 3, 2023
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ostree export and hard links
2 participants