When exporting, use hardlinks for duplicated files #3060

owtaylor · 2023-09-29T16:41:26Z

For ostree_repo_export_tree_to_archive(), and 'ostree export', when the exported tree contains multiple files with the same checksum, write an archive with hard links.

Without this, importing a tree, then exporting it again breaks hardlinks.

As an example of savings: this reduces the (compressed) size of the Fedora Flatpak Runtime image from 1345MiB to 712MiB.

Resolves: #2925

As noted in #2925, if this is considered insufficiently compatible, it could be put behind an option. If someone is untarring an 'ostree export' tarball and then using it read-write, then hardlinking coincidentally the same files (like all empty files) could be quite surprising. For typical usage of ostree, however, using hardlinks in the exported tar is less surprising.

There's a fair bit of memory usage during export from keeping all the OstreeRepoFile objects. Making the hash table be 'checksum => path string' might save some of that, though it would depend on the paths. I'm not sure which way is better.

openshift-ci · 2023-09-29T16:41:37Z

Hi @owtaylor. Thanks for your PR.

I'm waiting for a ostreedev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cgwalters

Thanks for working on this! This looks sane to me.

cgwalters · 2023-09-29T16:52:06Z

src/libostree/ostree-repo-libarchive.c

+              }
+            else
+              {
+                g_hash_table_insert (seen_checksums, (char *)checksum, g_object_ref (path));


This may be worth a comment like:

/* The checksum string is owned by the repo file object */ or so; I paused for a few seconds here thinking about the memory management.

cgwalters · 2023-09-29T16:52:52Z

tests/libtest.sh

@@ -249,6 +249,9 @@ setup_test_repository () {
    mkdir baz/another/
    echo x > baz/another/y

+    echo "SAME_CONTENT" > baz/duplicate_a
+    echo "SAME_CONTENT" > baz/duplicate_b


Let's also add a test case for hardlinks across directories?

cgwalters · 2023-09-29T16:54:56Z

/ok-to-test

owtaylor · 2023-09-29T16:57:08Z

Pushed a new version with clang-format applied. [EDIT: or maybe not. Will push a new version in a few minutes with your requested changes too]

cgwalters · 2023-09-29T17:00:46Z

There's a fair bit of memory usage during export from keeping all the OstreeRepoFile objects. Making the hash table be 'checksum => path string' might save some of that, though it would depend on the paths. I'm not sure which way is better.

The other alternative would be to walk the tree twice; the first time to find hardlinked objects. That may be a good CPU/memory tradeoff.

owtaylor · 2023-09-29T17:11:36Z

There's a fair bit of memory usage during export from keeping all the OstreeRepoFile objects. Making the hash table be 'checksum => path string' might save some of that, though it would depend on the paths. I'm not sure which way is better.

The other alternative would be to walk the tree twice; the first time to find hardlinked objects. That may be a good CPU/memory tradeoff.

You mean, the first time, keep a checksum => seen count hash table, then use that to save only selected OstreeRepoFile objects on the second pass?

The maximum savings over the "save the paths" approach is <average length of path * number of files in tree> - for a very large tree (million files), that might be around 100 MiB? I'm less sure about the savings compared to the "save the OstreRepoFiles" approach.

I think I'd rather avoid the complexity and the chance of getting it wrong, but if you feel strongly, I'm happy to do it that way too.

cgwalters · 2023-09-29T17:16:08Z

Definitely let's keep what you've tested, it's in the merge queue!

For ostree_repo_export_tree_to_archive(), and 'ostree export', when the exported tree contains multiple files with the same checksum, write an archive with hard links. Without this, importing a tree, then exporting it again breaks hardlinks. As an example of savings: this reduces the (compressed) size of the Fedora Flatpak Runtime image from 1345MiB to 712MiB. Resolves: ostreedev#2925

openshift-ci bot added the needs-ok-to-test label Sep 29, 2023

cgwalters approved these changes Sep 29, 2023

View reviewed changes

cgwalters reviewed Sep 29, 2023

View reviewed changes

openshift-ci bot added ok-to-test and removed needs-ok-to-test labels Sep 29, 2023

owtaylor force-pushed the export-hardlinks branch from 991555d to 7c98aed Compare September 29, 2023 16:56

cgwalters enabled auto-merge September 29, 2023 17:00

auto-merge was automatically disabled September 29, 2023 17:27
Head branch was pushed to by a user without write access

owtaylor force-pushed the export-hardlinks branch from 7c98aed to cef1a98 Compare September 29, 2023 17:27

owtaylor force-pushed the export-hardlinks branch from cef1a98 to 3b2fd6e Compare September 29, 2023 17:45

cgwalters merged commit befd844 into ostreedev:main Oct 3, 2023
21 checks passed

cgwalters mentioned this pull request Oct 24, 2023

fedora-toolbox:39 image is over 5GB due to missing hardlinking containers/toolbox#1389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When exporting, use hardlinks for duplicated files #3060

When exporting, use hardlinks for duplicated files #3060

owtaylor commented Sep 29, 2023

openshift-ci bot commented Sep 29, 2023

cgwalters left a comment

cgwalters Sep 29, 2023

cgwalters Sep 29, 2023

cgwalters commented Sep 29, 2023

owtaylor commented Sep 29, 2023 •

edited

Loading

cgwalters commented Sep 29, 2023

owtaylor commented Sep 29, 2023

cgwalters commented Sep 29, 2023

When exporting, use hardlinks for duplicated files #3060

When exporting, use hardlinks for duplicated files #3060

Conversation

owtaylor commented Sep 29, 2023

openshift-ci bot commented Sep 29, 2023

cgwalters left a comment

Choose a reason for hiding this comment

cgwalters Sep 29, 2023

Choose a reason for hiding this comment

cgwalters Sep 29, 2023

Choose a reason for hiding this comment

cgwalters commented Sep 29, 2023

owtaylor commented Sep 29, 2023 • edited Loading

cgwalters commented Sep 29, 2023

owtaylor commented Sep 29, 2023

cgwalters commented Sep 29, 2023

owtaylor commented Sep 29, 2023 •

edited

Loading