Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build and publish ARM images for kubeflow pipelines #10309

Open
thesuperzapper opened this issue Dec 12, 2023 · 20 comments
Open

build and publish ARM images for kubeflow pipelines #10309

thesuperzapper opened this issue Dec 12, 2023 · 20 comments

Comments

@thesuperzapper
Copy link
Member

thesuperzapper commented Dec 12, 2023

Description

Currently, Kubeflow Pipelines is only publishing amd64 container images, most other Kubeflow components are now publishing for both amd64 and arm64.

Here is the list of images that need to be updated:
(this was the list for 2.0.0-alpha.7, more may have been added for 2.0.0+)

  • gcr.io/ml-pipeline/cache-server
  • gcr.io/ml-pipeline/metadata-envoy
  • gcr.io/ml-pipeline/metadata-writer
  • gcr.io/ml-pipeline/api-server
  • gcr.io/ml-pipeline/persistenceagent
  • gcr.io/ml-pipeline/scheduledworkflow
  • gcr.io/ml-pipeline/frontend
  • gcr.io/ml-pipeline/viewer-crd-controller
  • gcr.io/ml-pipeline/visualization-server
  • gcr.io/tfx-oss-public/ml_metadata_store_server
  • gcr.io/google-containers/busybox

While most of these can run under Rosetta (on Apple Silicon Macs only), they run much slower and so are really only useful for testing.

Furthermore, the gcr.io/tfx-oss-public/ml_metadata_store_server image straight up does not work (even under emulation), I have made a separate Issue to track this one, as it is not controlled by KFP and is part of google/ml-metadata:


Love this idea? Give it a 👍.

@thesuperzapper
Copy link
Member Author

@chensun @zijianjoy I think this is a very important issue, as ARM64 (especially MacBooks) are now very common.

@thesuperzapper
Copy link
Member Author

I can see that there was a merged PR to make some builds succeed on ARM64 (from 2019):

But another one got closed due to inactivity:

I will tag the author of those PRs so they can comment on this @MrXinWang.

@rimolive
Copy link
Member

@thesuperzapper Let me know how can I help with this.

@Talador12
Copy link

+1 on this issue. Each quarter, more people are switching to Apple Silicon from older Intel Macs

@thesuperzapper
Copy link
Member Author

Another image is gcr.io/google-containers/busybox, which is used in place of the real image for cached pipeline steps (to run echo that says the step is cached).

@thesuperzapper thesuperzapper changed the title build and publish arm64 images for kubeflow pipelines build and publish ARM images for kubeflow pipelines Mar 22, 2024
@thesuperzapper
Copy link
Member Author

In my testing trying to build the images for linux/arm64, the only hard blockers are actually Python packages in the following images:

The problematic pip packages are:

There are already upstream Issues for some of them, but they mostly relate to Apple Silicone (slightly different from Linux ARM64), but I imagine that solving one will make it much easier to solve the other:

We either need to get those packages working so they can be pip installed on a Linux ARM, or remove our dependency on them.

@rimolive
Copy link
Member

@thesuperzapper metadata-write and visualization-server are kfpv1 deprecated components, so they're not required for kfpv2.

@AndersBennedsgaard
Copy link

AndersBennedsgaard commented May 15, 2024

We run a small ARM-based cluster which we want to run Kubeflow on, so I have started to build the components for ARM. I've been successful at building the cache-server, persistence agent, scheduled workflow agent, viewer-crd-controller, and frontend. I only had to set --platform=$BUILDPLATFORM as an argument in the first Dockerfile stage and, for all the Go based components, add GOOS=$TARGETOS GOARCH=$TARGETARCH in the go build step. However, building the API server seems to need a little more work.

The main reason for this, is that https://github.com/mattn/go-sqlite3/ now needs to be compiled with a cross-compiler, so I have to run apt-get install -y gcc-aarch64-linux-gnu g++-aarch64-linux-gnu, and set CC=aarch64-linux-gnu-gcc CXX=aarch64-linux-gnu-g++ CGO_ENABLED=1 environment variables during go build, which works!

However, this seems very fragile to changes in build server, new CPU architectures, etc., so I looked into why we even include SQLite - and the answer seems to be that we only use SQLite for integration testing?
So perhaps it would make sense to exclude it in the production image?

One way to do this is to move SQLite references to a separate db_sqlite.go file and use a // +build integration tag, and change test runs to use go test --tags=integration for integration tests.
That would make it possible to build the API server without additional C/C++ cross-compilers.

In fact, I have done this on our custom build and now I can build the binary and Docker container without SQLite with the same configuration change as with the other components mentioned above.

@AndersBennedsgaard
Copy link

I am considering looking at contributing some of my changes here, but I can't really figure out how the images are built. I expect that it has something to do with https://github.com/kubeflow/pipelines/blob/master/.cloudbuild.yaml? Perhaps @rimolive can give some pointers?

Also, what do you think of my proposal to remove SQLite from the final Go binary and only enable it for integration tests using build flags?

@thesuperzapper
Copy link
Member Author

thesuperzapper commented Jun 13, 2024

@AndersBennedsgaard if you want a quick way to build all the images for testing, you can use the same approach as the deployKF fork of Kubeflow Pipelines deployKF/kubeflow-pipelines which uses GitHub Actions (GHA) to build the images.

You can just take the same GHA configs as we add in this commit: deployKF@d800253. Even if you don't use the GHA configs directly, you can use them to figure out the full list of images that make up Kubeflow Pipelines and where their Dockerfile is.

NOTE: these workflows have build_platforms set to linux/amd64, but you could update it to linux/amd64 linux/arm64 (whitespace seperate) once you fix the ARM build issues, and they will then be built for both architectures.

NOTE 2: this excludes the gcr.io/tfx-oss-public/ml_metadata_store_server image, which is managed upstream (google/ml-metadata), and which I made a PR to allow building on ARM (google/ml-metadata#188), but even if they merged that, Google doesnt know how to build ARM images (or something like that), so we have a fork for that too (deployKF/ml-metadata), but you can just use this following image which is cross-compiled for ARM/X86 ghcr.io/deploykf/ml_metadata_store_server:1.14.0-deploykf.0

@AndersBennedsgaard
Copy link

AndersBennedsgaard commented Jun 13, 2024

@thesuperzapper as I mentioned in #10309 (comment), we already have KFP fully running on an ARM-only cluster, so I have already cross-compiled the images using BuildX+Qemu in our own fork.
I was talking about contributing the changes back upstream, but if you say that "Google doesnt know how to build ARM images", it might be hard for me to do. Alternatively, we could consider switching the CI pipeline to GH actions, since most(all?) other Kubeflow components already use this

@rimolive
Copy link
Member

rimolive commented Jun 13, 2024

Alternatively, we could consider switching the CI pipeline to GH actions, since most(all?) other Kubeflow components already use this

We are already working on migrating the CI pipelines to GitHub Actions. See #10744

@AndersBennedsgaard
Copy link

AndersBennedsgaard commented Jun 17, 2024

@rimolive #10744 does not mention changing the release workflow logic to GH actions. Should we include these in that issue?

@thesuperzapper would you mind adding all the relevant -license-compliance images built for KFP? Such as gci.io/ml-pipeline/workflow-controller

@rimolive
Copy link
Member

@rimolive #10744 does not mention changing the release workflow logic to GH actions. Should we include these in that issue?

Our priority is fixing the tests, we can figure out moving release workflow to GHA too but in another moment.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Aug 17, 2024
Copy link

github-actions bot commented Sep 8, 2024

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@github-actions github-actions bot closed this as completed Sep 8, 2024
@github-project-automation github-project-automation bot moved this from Blocked to Closed in KFP Runtime Triage Sep 8, 2024
@thesuperzapper
Copy link
Member Author

/reopen

@google-oss-prow google-oss-prow bot reopened this Sep 8, 2024
Copy link

@thesuperzapper: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-project-automation github-project-automation bot moved this from Closed to Needs triage in KFP Runtime Triage Sep 8, 2024
@github-actions github-actions bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Sep 9, 2024
Copy link

github-actions bot commented Nov 9, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Nov 9, 2024
@tarilabs
Copy link
Member

tarilabs commented Nov 9, 2024

(still relevant, bumping comment to avoid stale status)

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs triage
Development

No branches or pull requests

5 participants