ci: Refactor Dockerfile & entrypoint #8923

upbqdn · 2024-10-10T10:44:30Z

Motivation & Solution

The current Dockerfile and entrypoint.sh files contain a bunch of bugs. This PR contains the following changes:

Create a non-privileged system user in the runtime Docker stage and switch to it.
- Don't specify UID & GID.
- Don't specify the home dir for the new system user.
Don't use gosu.
Remove all packages from the runtime stage.
Remove some malfunctioning CI tests.
Don't use the EXPOSE instruction in Docker.
Bump the Rust version in Dockerfile.
Change the location of the entrypoint in Docker images.
Remove some redundant env vars.
Refactor the structure of the entrypoint.
Add docs and TODOs to the entrypoint.

Tests

Manually test that zebrad rund under the new zebra user:

Running

docker build --file docker/Dockerfile --target runtime --tag zebra:local .
docker run --detach --name zebra_local zebra:local
docker exec -it -u root zebra_local bash
apt-get update && apt-get install -y procps
ps aux

displays

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
zebra          1 86.3  2.7 6605720 2691512 ?     Ssl  09:49  31:03 zebrad -c /etc/zebrad/zebrad.toml
root         150  0.0  0.0   4188  3368 pts/0    Ss   10:23   0:00 bash
root         438  0.0  0.0   8088  4044 pts/0    R+   10:25   0:00 ps aux

PR Checklist

The PR name is suitable for the change log.
The solution is tested.
The PR has a priority label.

…to docker-refactor

The test was using a custom config file set in `test_variables`. However, the file was not included in the Docker image, and the entrypoint script created a new, default one under the original file's path. Zebra then loaded this new file, and the test passed because the pattern in `grep_patterns` matched Zebra's output containing the original path, even though the config file was different.

gustavovalverde · 2024-10-16T09:03:52Z

@upbqdn can the motivation be expanded/updated here? For future reference, it might be confusing if someone looks at the PR and understand all these changes were required to fix the use of the root user.

gustavovalverde

I haven't fully review the entrypoint.sh as some of these requested changes might have a slight impact there. Changes to CI files—unless they're related to the Docker changes—should be in a different PR that depends on this one.

gustavovalverde · 2024-10-16T11:32:12Z

docker/Dockerfile

-ARG UID=10001
-ENV UID=${UID}
-ARG GID=10001
-ENV GID=${GID}
-
-RUN addgroup --system --gid ${GID} ${USER} \
-    && adduser \
-    --system \
-    --disabled-login \
-    --shell /bin/bash \
-    --home ${APP_HOME} \
-    --uid "${UID}" \
-    --gid "${GID}" \
-    ${USER}


There might be instances where a user would like to (re-)build the image with a custom UID:GUID, as they might require to mount files from their host, which will be incompatible with the UID:GUID of the container user, and it's also a good Docker practice to specify the UID for the default user for other edge-cases.

Unless there's a good reason to remove this, I'd suggest to keep it.

The PR that added this has some references to the reasoning behind it: #8803 (comment)

Note that we're creating a system user. Those users should have UIDs in [1, 999].

In Debian, the docs for adduser say that the default dynamic range for system UIDs is defined by [FIRST_SYSTEM_UID, LAST_SYSTEM_UID], which is [100, 999] in etc/adduser.conf: https://manpages.debian.org/bullseye/adduser/adduser.8.en.html

Moreover, the Linux spec says that system UIDs in [100, 499] should be reserved for dynamic allocation: https://refspecs.linuxfoundation.org/LSB_5.0.0/LSB-Core-generic/LSB-Core-generic/uidrange.html

I'm surprised the Docker docs don't mention this.

adduser automatically checks if a UID is available before assigning one from the right range. Since I wasn't sure what UID to pick, and no user brought it up, I removed it to avoid scope creep. I'd prefer to address this issue according to the spec once a user brings it up.

Unfortunately, this is not clear in Docker's documentation. But some of their tools, docker init for example, output the recommended approach, which is the following (with alpine linux):

ARG UID=10001 RUN adduser \ --disabled-password \ --gecos "" \ --home "/nonexistent" \ --shell "/sbin/nologin" \ --no-create-home \ --uid "${UID}" \ appuser USER appuser

In any case, to match their recommendation in debian, this would change to:

RUN addgroup --system --gid ${GID} ${USER} \ && useradd \ --uid "${UID}" \ --no-create-home \ --home-dir /nonexistent \ --shell /usr/sbin/nologin \ --comment "" \ appuser

They main reasons to keep this instruction are:

Host and Container Alignment: When you mount host directories into the container, file permissions are based on UID and GID. If the container's user ID match those on the host, you avoid permission issues.

I've experienced this myself when I was testing the docker-compose and mounting a cached state from my host into the container, which had different UID:GID, and the container was not able to write to it.

Predictable IDs: By specifying the UID and GID, you ensure that the user inside the container has the same IDs across different builds, deployments, and host systems.

Additionally, this is very common in Docker, and plenty of projects use this approach to deal with these and other use-cases: https://github.com/search?q=%22useradd%22+%22--uid%22++language%3ADockerfile&type=code

gustavovalverde · 2024-10-16T12:18:13Z

docker/Dockerfile

-ARG APP_HOME
-ENV APP_HOME=${APP_HOME}
-WORKDIR ${APP_HOME}


We should set a WORKDIR otherwise the user will end up in the / directory with no permissions, which they might require for testing/personal purposes.

The ARG + ENV combination also allows the user to set a custom directory, in case their host permissions does not allow the one we've chosen.

System users should have no home dir and should not even be able to log in to the machine. Our users should use -u root when they're logging to the machine. The logic I had in mind is very simple and minimal:

The whole runtime target has its entrypoint executed under the non-privileged zebra system user which has no home dir and no login (just as in a regular Linux env).

The Dockerfile sets up the minimal requirements for the zebra user to execute the entrypoint.

When our user wants to do some tweaks, they explicitly use -u root, which they can always do, and which gives them the clarity that they have full privileges.

I wanted to describe this in our docs in a subsequent PR. Is it OK if we do it like that?

gustavovalverde · 2024-10-16T12:22:47Z

docker/Dockerfile

-ARG FEATURES
-ENV FEATURES=${FEATURES}


If the runtime stage is built with custom FEATURES this will be propagated by default as an ENV variable to the entrypoint.sh, if we remove this instructions, then we need to always specify the FEATURES environment variable when running the image, otherwise it will be empty.

Yes, but that var is not used in any meaningful way in entrypoint.sh, so I removed it. I would like to refactor the way we configure Zebra inside Docker in follow-up PRs.

It's used here:
https://github.com/ZcashFoundation/zebra/pull/8923/files#diff-4f5cabe26761257a4d685a6edc7a43e0fe0f78762f50eeb48530f2bd3b3ee7caR81
and here:
https://github.com/ZcashFoundation/zebra/pull/8923/files#diff-4f5cabe26761257a4d685a6edc7a43e0fe0f78762f50eeb48530f2bd3b3ee7caR100

From our documentation, if we suggest:

docker build -f ./docker/Dockerfile --target runtime --build-arg FEATURES='default-release-binaries prometheus' --tag local/zebra.mining:latest .

The entrypoint will evaluate the $FEATURES var and it will be empty, as it was never defined as a variable. This would be confusing for the user as they built it with --build-arg FEATURES='default-release-binaries prometheus' but the configuration file is not adding the corresponding section, as that argument was not passed as an Environment variable (at build time) to the container.

gustavovalverde · 2024-10-16T12:26:14Z

docker/Dockerfile

-RUN mkdir -p ${ZEBRA_CONF_DIR} && chown ${UID}:${UID} ${ZEBRA_CONF_DIR} \
-    && chown ${UID}:${UID} ${APP_HOME}


Having a HOME directory for the application is a good practice, it also starts the container in an empty directory the users can use as they see fit.

System users should have no home dir. Our users should go with -u root. Another approach would be to execute the entrypoint under root, and then run zebrad under the zebra system user. Our users could then login implicitly as root without -u root, but that adds a bit of extra complexity to the entrypoint, which I wanted to keep simple for now.

If we're not allowing the user to log in with the user running Zebra, then we should document the approach they can use to get a bash/sh terminal in the container to troubleshoot in it.

docker/Dockerfile

gustavovalverde · 2024-10-16T12:34:31Z

docker/entrypoint.sh

+  prepare_env_vars
+
+  if [[ ! -f "${ZEBRA_CONF_PATH}" ]] && [[ -d "${ZEBRA_CONF_DIR}" ]]; then
+    ZEBRA_CONF_PATH="${ZEBRA_CONF_DIR}/zebrad.toml"


This default should be set in the Dockerfile, to keep it as the single source of truth for default variables.

Which var do you have in mind?

We can use the var you defined: ZEBRA_CONF_PATH. But have the default value defined in the Dockerfile, as we do here:

zebra/docker/Dockerfile

Line 189 in afeb05f

ENV ZEBRA_CONF_DIR=${ZEBRA_CONF_DIR}

gustavovalverde · 2024-10-16T12:39:24Z

.github/workflows/sub-ci-unit-tests-docker.yml

-  # Test reconfiguring the the docker image for tesnet.
-  test-configuration-file-testnet:
-    name: Test CI testnet Docker config file
-    # Make sure Zebra can sync the genesis block on testnet
-    uses: ./.github/workflows/sub-test-zebra-config.yml
-    with:
-      test_id: 'testnet-conf'
-      docker_image: ${{ vars.GAR_BASE }}/${{ vars.CI_IMAGE_NAME }}@${{ inputs.image_digest }}
-      grep_patterns: '-e "net.*=.*Test.*estimated progress to chain tip.*Genesis" -e "net.*=.*Test.*estimated progress to chain tip.*BeforeOverwinter"'
-      # TODO: improve the entrypoint to avoid using `ENTRYPOINT_FEATURES=""`
-      test_variables: '-e NETWORK -e ZEBRA_CONF_PATH="/etc/zebrad/zebrad.toml" -e ENTRYPOINT_FEATURES=""'
-      network: 'Testnet'
-


This test was created to confirm that any change we do in CI or in Docker doesn't affect the ability to read the proper $NETWORK environment variable. As it had happened before that some changes breaks this behavior, and then the tests are running in Mainnet instead of Testnet, but we realized too late or had to wait for some tests to run to confirm it.

This test is failing with this PR, similarly to the other one. I have a better approach in mind, which I didn't do in this PR. Let's add it back in a subsequent PR?

Moreover, there seem to be some parts that are bugs or hard to understand. For example, I couldn't figure out what -e NETWORK in test_variables is supposed to do. Also, setting ENTRYPOINT_FEATURES to the empty string to enable the test in the entrypoint makes it very hard to follow the execution path in the whole pipeline.

-e NETWORK tells docker to use whatever value the $NETWORK env variable is set to, to override any default that was set at build time, or in the Dockerfile. This happens here:

zebra/.github/workflows/sub-test-zebra-config.yml

Lines 90 to 91 in 46c6b6e

env:

NETWORK: '${{ inputs.network }}'

and makes Zebra run with a Testnet configuration, and the test validates that Zebra is correctly running with it.

This has saved me (and others) from making mistakes multiple times while making CI refactors, so this is very important under those circumstances.

I do agree that setting the ENTRYPOINT_FEATURES to an empty string is a dirty hack to make this work, but that's a tech debt that wouldn't justify removing the whole test. In any case, we can remove the use of the ENTRYPOINT_FEATURES variables, while keeping this test behavior.

I'd suggest commenting this and adding a TODO in top of it, or creating an issue, instead of removing the test. Just so we don't forget later on, as this is an important validation.

gustavovalverde · 2024-10-16T12:46:19Z

.github/workflows/cd-deploy-nodes-gcp.yml

-  # Test that Zebra works using $ZEBRA_CONF_PATH config
-  test-zebra-conf-path:
-    name: Test CD custom Docker config file
-    needs: build
-    uses: ./.github/workflows/sub-test-zebra-config.yml
-    with:
-      test_id: 'custom-conf'
-      docker_image: ${{ vars.GAR_BASE }}/zebrad@${{ needs.build.outputs.image_digest }}
-      grep_patterns: '-e "loaded zebrad config.*config_path.*=.*v1.0.0-rc.2.toml"'
-      test_variables: '-e NETWORK -e ZEBRA_CONF_PATH="zebrad/tests/common/configs/v1.0.0-rc.2.toml"'
-      network: ${{ inputs.network || vars.ZCASH_NETWORK }}
-


Although very simple, the objective of this test is to confirm that the entrypoint is able to handle a custom configuration path ($ZEBRA_CONF_PATH), run with it and confirm that path is being used.

This could be extended to validate is running with a mounted file using --mount type=bind,source="$(pwd)"/target,target=/app as part of the test_variables, but that's out of scope.

The test was using a custom config file set in test_variables.
However, the file was not included in the Docker image, and the
entrypoint script created a new, default one under the original file's
path. Zebra then loaded this new file, and the test passed because the
pattern in grep_patterns matched Zebra's output containing the
original path, even though the config file was different.

The test fails in this PR due to the fixes in the entrypoint.

I'd suggest commenting this and adding a TODO in top of it, or creating an issue. Just so we don't forget later on.

Either of them I'd suggest indicating something like:

We need to create a test that validates we can mount a configuration file to a different path that the default used by ZEBRA_CONF_PATH, and that Zebra runs using this new file and path.

upbqdn · 2024-10-17T12:02:30Z

@upbqdn can the motivation be expanded/updated here? For future reference, it might be confusing if someone looks at the PR and understand all these changes were required to fix the use of the root user.

I updated the PR description.

upbqdn · 2024-10-18T09:13:49Z

My yml linter updated the formatting of the cd-deploy-nodes-gcp.yml file when merging main. I'm happy to revert those changes if they don't fit.

mpguerra · 2024-11-14T09:07:57Z

Do we want to do anything further here? Otherwise we should either merge or close without merging

gustavovalverde · 2024-11-14T10:10:47Z

Do we want to do anything further here? Otherwise we should either merge or close without merging

I added some review comments which are pending for a reply, and we should also consider the latest interaction we had with some users in Discord, as some changes related to permission handling, mounting a configuration file, and the use of curl, etc, should be considered so we don't break those use-cases.

upbqdn · 2024-11-14T10:25:06Z

I'm planning to address the comments and get the PR approved.

upbqdn added 10 commits October 2, 2024 11:55

Refactor formatting & docs

64be0ae

Refactor the runtime stage in Dockerfile

a248b14

Remove unused code from entrypoint.sh

ca1620c

Simplify entrypoint.sh setup

ea8b119

Revise docs & formatting

7cba8cf

Adjust default values for env vars

56c65e6

Bump Rust v from 1.79 to 1.81 in Dockerfile

be38132

Refactor entrypoint.sh

0492b7a

Refactor Dockerfile

6595740

Add TODOs for monitoring stage to Dockerfile

22dc738

upbqdn added C-bug Category: This is a bug A-devops Area: Pipelines, CI/CD and Dockerfiles C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG P-Medium ⚡ labels Oct 10, 2024

upbqdn self-assigned this Oct 10, 2024

upbqdn requested a review from a team as a code owner October 10, 2024 10:44

upbqdn requested review from arya2 and removed request for a team October 10, 2024 10:44

upbqdn marked this pull request as draft October 10, 2024 10:45

upbqdn removed the request for review from arya2 October 10, 2024 10:45

upbqdn added 6 commits October 10, 2024 12:45

Merge branch 'main' into docker-refactor

962d4d3

Refactor Dockerfile

621754b

Add TODOs for monitoring stage to Dockerfile

6b68592

Merge branch 'docker-refactor' of github.com:ZcashFoundation/zebra in…

c788ccc

…to docker-refactor

Fix a typo

38837e4

Merge branch 'main' into docker-refactor

b58a602

oxarbitrage added the do-not-merge Tells Mergify not to merge this PR label Oct 10, 2024

upbqdn added 3 commits October 11, 2024 12:44

Allow running zebrad in test mode

2ada296

Merge branch 'docker-refactor' of github.com:ZcashFoundation/zebra in…

99cd18f

…to docker-refactor

Merge branch 'main' into docker-refactor

c718609

upbqdn force-pushed the docker-refactor branch 3 times, most recently from eda21a9 to 32ef2ef Compare October 11, 2024 14:29

upbqdn added 2 commits October 11, 2024 16:45

Allow custom config for zebrad in test mode

69b03d4

Remove curl from the runtime Docker image

6932d9a

upbqdn force-pushed the docker-refactor branch from 32ef2ef to 6932d9a Compare October 11, 2024 14:45

upbqdn added 2 commits October 12, 2024 16:48

Remove redundant echos

6fe460d

upbqdn force-pushed the docker-refactor branch from cdc54ba to c5010b8 Compare October 12, 2024 15:08

upbqdn added 2 commits October 12, 2024 17:59

Remove a redundant CI test

e05df78

Remove all packages from the runtime stage

e9f0479

upbqdn force-pushed the docker-refactor branch from 485423a to e9f0479 Compare October 14, 2024 09:30

Merge branch 'main' into docker-refactor

7422ecf

upbqdn requested a review from gustavovalverde October 14, 2024 11:29

upbqdn removed the do-not-merge Tells Mergify not to merge this PR label Oct 14, 2024

upbqdn marked this pull request as ready for review October 14, 2024 11:34

Docs cosmetics

4fa064c

gustavovalverde requested changes Oct 16, 2024

View reviewed changes

Merge branch 'main' into docker-refactor

afeb05f

arya2 added the do-not-merge Tells Mergify not to merge this PR label Oct 26, 2024

mpguerra removed the do-not-merge Tells Mergify not to merge this PR label Nov 21, 2024

arya2 added do-not-merge Tells Mergify not to merge this PR and removed do-not-merge Tells Mergify not to merge this PR labels Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: Refactor Dockerfile & entrypoint #8923

ci: Refactor Dockerfile & entrypoint #8923

upbqdn commented Oct 10, 2024 •

edited

Loading

gustavovalverde commented Oct 16, 2024

gustavovalverde left a comment •

edited

Loading

gustavovalverde Oct 16, 2024

upbqdn Oct 17, 2024 •

edited

Loading

gustavovalverde Oct 24, 2024

gustavovalverde Oct 16, 2024 •

edited

Loading

gustavovalverde Oct 16, 2024

upbqdn Oct 17, 2024 •

edited

Loading

gustavovalverde Oct 16, 2024

upbqdn Oct 17, 2024

gustavovalverde Oct 24, 2024 •

edited

Loading

gustavovalverde Oct 16, 2024

upbqdn Oct 17, 2024

gustavovalverde Oct 24, 2024

gustavovalverde Oct 16, 2024

upbqdn Oct 16, 2024

gustavovalverde Oct 24, 2024

gustavovalverde Oct 16, 2024

upbqdn Oct 17, 2024 •

edited

Loading

gustavovalverde Oct 24, 2024 •

edited

Loading

gustavovalverde Oct 16, 2024

upbqdn Oct 16, 2024

upbqdn Oct 16, 2024

gustavovalverde Oct 24, 2024 •

edited

Loading

upbqdn commented Oct 17, 2024

upbqdn commented Oct 18, 2024

mpguerra commented Nov 14, 2024

gustavovalverde commented Nov 14, 2024 •

edited

Loading

upbqdn commented Nov 14, 2024

		RUN mkdir -p ${ZEBRA_CONF_DIR} && chown ${UID}:${UID} ${ZEBRA_CONF_DIR} \
		&& chown ${UID}:${UID} ${APP_HOME}

ci: Refactor Dockerfile & entrypoint #8923

Are you sure you want to change the base?

ci: Refactor Dockerfile & entrypoint #8923

Conversation

upbqdn commented Oct 10, 2024 • edited Loading

Motivation & Solution

Tests

PR Checklist

gustavovalverde commented Oct 16, 2024

gustavovalverde left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

upbqdn Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gustavovalverde Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

upbqdn Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gustavovalverde Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

upbqdn Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

gustavovalverde Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gustavovalverde Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

upbqdn commented Oct 17, 2024

upbqdn commented Oct 18, 2024

mpguerra commented Nov 14, 2024

gustavovalverde commented Nov 14, 2024 • edited Loading

upbqdn commented Nov 14, 2024

upbqdn commented Oct 10, 2024 •

edited

Loading

gustavovalverde left a comment •

edited

Loading

upbqdn Oct 17, 2024 •

edited

Loading

gustavovalverde Oct 16, 2024 •

edited

Loading

upbqdn Oct 17, 2024 •

edited

Loading

gustavovalverde Oct 24, 2024 •

edited

Loading

upbqdn Oct 17, 2024 •

edited

Loading

gustavovalverde Oct 24, 2024 •

edited

Loading

gustavovalverde Oct 24, 2024 •

edited

Loading

gustavovalverde commented Nov 14, 2024 •

edited

Loading