Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bats tests - parallelize #5552

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Conversation

edsantiago
Copy link
Member

All bats tests run with custom root/runroot, so it should be
possible to parallelize them.

(As of this initial commit, tests fail on my laptop, and I expect them to fail here. I just want to get a sense for how things go.)

Signed-off-by: Ed Santiago [email protected]

None

@edsantiago edsantiago marked this pull request as draft May 29, 2024 17:45
Copy link
Contributor

openshift-ci bot commented May 29, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: edsantiago
Once this PR has been reviewed and has the lgtm label, please assign rhatdan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@edsantiago edsantiago force-pushed the bats-parallel branch 2 times, most recently from 762e878 to a195865 Compare June 10, 2024 12:30
@edsantiago edsantiago force-pushed the bats-parallel branch 2 times, most recently from 4d4e9c9 to e653a16 Compare June 27, 2024 14:00
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume once we enabled parallel runs here we can port that over for the bud tests on podman? I like to get the speed up there too.

I haven't looked deeply into the prefetch logic changes but this looks much easier than podman so that is good.

@@ -19,4 +19,4 @@ function execute() {
TESTS=${@:-.}

# Run the tests.
execute time bats --tap $TESTS
execute time bats -j 4 --tap $TESTS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use $(nproc) here instead of hard coding any count, should make it much faster when run locally.

Copy link

A friendly reminder that this PR had no activity for 30 days.

@mtrmac
Copy link
Contributor

mtrmac commented Oct 10, 2024

(A brief look after a link from containers/skopeo#2437 )

zstd found even though push was without --force-compression: My first guess is that the FROM alpine; _EOF - built image (via build caching?) matches an image created in some other concurrently running test, and that causes BlobInfoCache to have a record about the zstd-compressed layer version.

For this purpose, it would be better to have the image’s layers be clearly independent from any other test — maybe a FROM scratch adding a file that contains the test name + timestamp.

@mtrmac
Copy link
Contributor

mtrmac commented Oct 10, 2024

… oh, and: debug-level logs could help.

@edsantiago
Copy link
Member Author

For this purpose, it would be better to have the image’s layers be clearly independent from any other test — maybe a FROM scratch adding a file that contains the test name + timestamp.

Thank you. I thought I had checked for conflicts, but must've missed something. I'll look into this again when time allows.

@edsantiago edsantiago force-pushed the bats-parallel branch 5 times, most recently from d17cb1b to 68722ca Compare November 4, 2024 13:35
@edsantiago
Copy link
Member Author

push test still flaking, and I'm giving up for the day. It is stumping me:

not ok 50 bud: build push with --force-compression

# # buildah build ...
...
# f0e6fc41f58b93d990cb24331c648bc84b066822d06782aec493d97bc2b7b263
# # buildah push [...]--compression-format gzip img-t50-nesrbrf5 docker://localhost:35311/img-t50-nesrbrf5
...
# # buildah push [...] --compression-format zstd --force-compression=false img-t50-nesrbrf5 docker://localhost:35311/img-t50-nesrbrf5
...
# # skopeo inspect img-t50-nesrbrf5
# {"schemaVersion":2,"mediaType":"application/vnd.oci.image.manifest.v1+json","config":{"mediaType":"application/vnd.oci.image.config.v1+json","digest":"sha256:f0e6fc41f58b93d990cb24331c648bc84b066822d06782aec493d97bc2b7b263","size":524},"layers":[{"mediaType":"application/vnd.oci.image.layer.v1.tar+zstd","digest":"sha256:05d36d22e1ad1534254c6965a3b43cf39f4dca9d5c95551eccf40108f076da2b","size":146}],"annotations":{"org.opencontainers.image.base.digest":"","org.opencontainers.image.base.name":""}}
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #|     FAIL: zstd found even though push was without --force-compression
# #| expected: !~ 'zstd'

Where is that 05d36 layer coming from?

@mtrmac
Copy link
Contributor

mtrmac commented Nov 5, 2024

I can’t see any obvious reason. I’d suggest doing the pushes with --log-level=debug, that should include traces like Trying to reuse blob

@siteshwar
Copy link

I understand this pull request is in draft state, but I got this warning on OpenScanHub:

Error: SHELLCHECK_WARNING ([CWE-398](https://cwe.mitre.org/data/definitions/398.html)): [[#def1]](https://openscanhub.fedoraproject.org/task/21458/log/added.html#def1)
/usr/share/buildah/test/system/helpers.bash:211:24: warning[[SC2115](https://github.com/koalaman/shellcheck/wiki/SC2115)]: Use "${var:?}" to ensure this never expands to / .
#  209|               else
#  210|                   # Failed. Clean up, so we don't leave incomplete remnants
#  211|->                 rm -fr $_BUILDAH_IMAGE_CACHEDIR/$fname
#  212|               fi
#  213|

It should be fixed before merging this pull request.

@edsantiago
Copy link
Member Author

Got it with debug. In-page search for 10b119 demonstrates the flake.

@edsantiago
Copy link
Member Author

another one

@mtrmac
Copy link
Contributor

mtrmac commented Nov 12, 2024

Got it with debug. In-page search for 10b119 demonstrates the flake.

Note to self:
First:

[+2477s] # time="2024-11-07T16:25:26Z" level=debug msg="Ignoring BlobInfoCache record of digest \"sha256:e4216d41a8ad46a3b75f9cdc2f40cab3cd56837931d4e73be4521a9537d7cde1\", uncompressed format does not match

OK. Then:

[+2477s] # time="2024-11-07T16:26:03Z" level=debug msg="Ignoring BlobInfoCache record of digest \"sha256:e4216d41a8ad46a3b75f9cdc2f40cab3cd56837931d4e73be4521a9537d7cde1\" with unknown compression"

How did we lose the compression format knowledge??

That’s not the immediate cause, but it suggests something unexpected is happening. Both have Using SQLite blob info cache at /var/lib/containers/cache/blob-info-cache-v1.sqlite.

@edsantiago
Copy link
Member Author

The test image is created via

FROM scratch
COPY /therecanbeonly1 /uniquefile     (thecanbeonly1 contains date + random content)

It is the same image used for each test iteration (push --compression-format gzip, zstd --force-compression=false, zstd, and zstd --force-compression).

One approach I've considered but not tried: build a new image on each iteration. My gut tells me that this might get tests to pass, but is not necessarily the right thing to do. It depends on whether this issue is a real one that we might be sweeping under the rug.

edsantiago added a commit to edsantiago/buildah that referenced this pull request Nov 18, 2024
The _prefetch helper, introduced in containers#2036, is not parallel-safe: two
or more parallel jobs fetching the same image can step on each other
and produce garbage images.

Although we still can't run buildah tests in parallel (see containers#5552),
we can at least set up the scaffolding for that to happen. This
commit reworks _prefetch() such that the image work is wrapped
inside flock. It has been working fine for months in containers#5552,
and is IMO safe for production. This can then make it much
easier to flip the parallelization switch once the final zstd
bug is squashed.

Signed-off-by: Ed Santiago <[email protected]>
edsantiago added a commit to edsantiago/buildah that referenced this pull request Nov 19, 2024
The _prefetch helper, introduced in containers#2036, is not parallel-safe: two
or more parallel jobs fetching the same image can step on each other
and produce garbage images.

Although we still can't run buildah tests in parallel (see containers#5552),
we can at least set up the scaffolding for that to happen. This
commit reworks _prefetch() such that the image work is wrapped
inside flock. It has been working fine for months in containers#5552,
and is IMO safe for production. This can then make it much
easier to flip the parallelization switch once the final zstd
bug is squashed.

Signed-off-by: Ed Santiago <[email protected]>
All bats tests run with custom root/runroot, so it should be
possible to parallelize them.

Signed-off-by: Ed Santiago <[email protected]>
Copy link
Member Author

@edsantiago edsantiago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about as clean as I can leave this PR. Good luck!

@@ -165,6 +165,7 @@ _EOF
# Helper function. push our image with the given options, and run skopeo inspect
function _test_buildah_push() {
run_buildah push \
--log-level=debug \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXME! Remove this. It is only present in order to debug the zstd flake.

@@ -22,6 +22,7 @@ function mkcw_check_image() {

# Decrypt, mount, and take a look around.
uuid=$(cryptsetup luksUUID "$mountpoint"/disk.img)
echo "# uuid=$uuid" >&3
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants