Umbrella Issue: Porting Kubeflow to IBM Power (ppc64le) #781

lehrig · 2022-10-25T18:47:48Z

/kind feature

Enable builds & releases for IBM Power (ppc64le architecture). This proposal was presented with these slides at the 2022-10-25 Kubeflow community call with positive community feedback. We also created this design documentation: https://docs.google.com/document/d/1nGUvLonahoLogfWCHsoUOZl-s77YtPEiCjWBVlZjJHo/edit?usp=sharing

Why you need this feature:

Widen scope of possible on-premises deployments (vanilla Kubernetes & OpenShift on Power)
More general independence regarding processor architecture (x86, ppc64le, arm, …)
Unified container builds

Describe the solution you'd like:

Upstreaming changes that allow to build Dockerfiles on multiple architecture (starting with x86 & ppc64le)
Upstreaming CI integration for multi-arch builds (starting with x86 & ppc64le)

We currently plan to divide our efforts into multiply phases:

low-hanging "easy" integrations where no or minor code changes are needed; excluding KFP; Kubeflow 1.7 release scope (✅ done),
same as 1. but now including additional KServe components for model serving; Kubeflow 1.8 release scope,
same as 1. but now including KFP; Kubeflow 1.9 release scope,
more complex integrations where external dependencies to python wheels exist.

Below is a detailed overview of each required integration, including links to associated PRs if those already exist.

Phase 1 Integrations (Kubeflow 1.7 scope)

Phase 2 Integrations (Kubeflow 1.9 scope)

Phase 3 Integrations (Kubeflow 1.10 scope)

Note: KFP is currently blocked by kubeflow/pipelines#8660 / GoogleCloudPlatform/oss-test-infra#1972

Phase 4 Integrations (Post Kubeflow 1.11 scope)

KFP: Api Server
KFP: Metadata Writer
KFP: Visualization Server
ml-metadata (KFP wheel dep.): Adding multi-arch support for linux/ppc64le in ml-metadata google/ml-metadata#171
KServe: Storage Initializer: blocked by buildx and ppc64le wheel pyca/cryptography#7723
OIDC Auth (external): Enable oidc-authservice repository CI for power(ppc64le) architecture. arrikto/oidc-authservice#104; on-hold as potentially irrelevant as of Kubeflow v1.8 (Move away from AuthService manifests#2469)

kimwnasptd · 2022-10-31T12:23:03Z

Thanks for creating this tracking issue @lehrig!

I'm onboard with adding support for ppc64le, since this will greatly help KF adoption. The proposed plan makes sense.

My initial question at this time is whether we need to build different executables for this platform, which means we need a new set of images. I see in the PRs that the only needed change is to actually not set a specific platform, but I might be missing something.

Could you provide some more context on this one?

lehrig · 2022-11-03T08:57:24Z

@kimwnasptd, thanks for your support!

There are essentially 2 options for publishing images:

Multi-arch images, where we publish only 1 "virtual" image with support for multiple architectures. A pull command will then only fetch the concrete container image for the required platform. To do so, I'd recommend using buildx (e.g., see https://www.docker.com/blog/multi-arch-build-and-images-the-simple-way/) because it is easier to use / more automated compared to manually creating a container manifest file for multiple architectures.
Separate images per architecture.

IMO 1. should be the preferred solution. A challenge here will be that builds over all Kubeflow components are quite inconsistent. For example, some projects already use buildx while some others don't. I'd opt for implementing more consistency in the scope of this endeavor, e.g., by migrating builds towards buildx where feasible. My team would be willing to drive this, if this sounds good.

kimwnasptd · 2022-11-10T16:07:54Z

@lehrig I agree with the first approach as well, if it's viable to avoid having multiple manifests.

Docker's buildx seems promising. I hadn't used it the past though, but it seems quite straight forward. I don't have a hard preference on using buildx as long as we don't lock in ourselves and end up with Dockerfiles that need specific Docker features and can only be build with docker.

lehrig · 2022-11-23T13:56:11Z

As wished by @kimwnasptd, quoting myself from kubeflow/kubeflow#6650 (comment) to clarify how we envision multi-arch builds:

Yes, it's good to let GO determine the arch, so we don't have to maintain an arch list explicitly here or wrap around boiler-plate code with arch-specific if/else statements.

Instead, we now shift the control which arch is actually build to the build system. If doing nothing special, the arch of the build machine is simply used (and as Kubeflow is currently build on amd64, we are backwards-compatible to the current behavior).

In further PRs, we will additionally modify docker-based builds using buildx, where you can, for instance, do something like this:
docker buildx build --platform linux/amd64,linux/ppc64le ...

Here, docker will automatically run actually 2 builds: one for amd64 and one for ppc64le. When coming to above GO-code, GO will acknowledge the external platform configuration and build it correctly. In case no native hardware is available for the given platform, docker will emulate the architecture using QEMU - so you can also build for different archs on amd64.

The final outcome is a single multi-arch image with support of all archs listed in Docker's platform statement.

kimwnasptd · 2022-12-02T14:23:42Z

@lehrig @adilhusain-s! the first image in this repo with support for ppc64le is up! 🎉

https://hub.docker.com/layers/kubeflownotebookswg/notebook-controller/80f695e/images/sha256-2870219816f6be1153ca97eb604b4f20393c34afdb4eade83f0966ccf90f8018?context=explore

kimwnasptd · 2023-02-15T13:34:03Z

@lehrig @pranavpandit1 @adilhusain-s I realized that right now we've only implemented the logic for using docker buildx only for the Actions that run when a PR is merged, and not when a PR is opened.

Realized that the Centraldashboard image was not getting build, even though the PR checks were green:
28a24ffb170769a228d46a19892f7420b22a0816
74f020e0d9c3f58712a3b466f9d1bb86c4607beb
65e41bf28b8e79be4e1f822afe56e218c69db8a1

We fixed the issue for this in kubeflow/kubeflow#6960, but we should be able to catch errors for the multi-arch build when a PR is opened as well.

The fix should be straightforward. We'll just need to use the same build command in both types of actions. Referencing the relevant parts in one:
https://github.com/kubeflow/kubeflow/blob/master/.github/workflows/centraldb_intergration_test.yaml#L24
https://github.com/kubeflow/kubeflow/blob/master/.github/workflows/centraldb_docker_publish.yaml#L37-L41

kimwnasptd · 2023-02-15T13:43:03Z

Do you have cycles to help with this effort?

lehrig · 2023-02-15T13:44:46Z

@kimwnasptd let me confirm with the team but I think we can handle it

pranavpandit1 · 2023-02-17T11:31:54Z

@kimwnasptd let me confirm with the team but I think we can handle it

@kimwnasptd: Thanks for all the inputs,
we have started looking into the required changes and will keep everyone updated once we start raising PRs for the same.

lehrig · 2023-02-17T14:27:51Z

Note I updated the main description by adding a phase for Kubeflow 1.8 scope and linking to this novel design document https://docs.google.com/document/d/1nGUvLonahoLogfWCHsoUOZl-s77YtPEiCjWBVlZjJHo/edit?usp=sharing

We use this document to discuss Phase 2 with the KFP community (related to kubeflow/pipelines#8660).

kimwnasptd · 2023-02-21T11:20:36Z

Thanks @pranavpandit1! I also took a look on how to get these to work, when fixing the workflows for the CentralDashboard. You can take a look at this PR and some comments kubeflow/kubeflow#6961

kimwnasptd · 2023-03-24T15:59:34Z

@lehrig I think we bumped into a side-effect that I hadn't thought about initially. Building the images in the GH Actions (which is doing virtualization via QEMU) is actually slow.

Looking at an open PR kubeflow/kubeflow#7060 that touches some web apps I see the following:

The workflow that builds for both platforms (VWA) takes 61minutes (!)
The workflow that builds with the old way (TWA) takes 9mins

The difference is huge, so I want to re-evaluate the approach of building for ppc64le when PRs are opened.

From your experience, when is it most probable case for the x86 build to succeed but for the ppc64le to fail?

lehrig · 2023-03-27T09:15:13Z

@kimwnasptd yeah, I agree that this is suboptimal. The answer is obviously "it depends", however, I think we have some hard evidence here that we should not proceed as originally planned. I see the following options for builds on PR opened.

Exclude non-x86 archs as long as native hardware is unavailable (example: Docker Fixes: Ensure multiple architecture build is disabled for PRs DSpace/dspace-angular#1667).
Exclude non-x86 archs only if building them takes too long.
Wait for native ppc64le out-of-the-box support for GHA, which hopefully comes this year (and this will not slow-down builds as emulation is not used).
Integrate a SSH-based connection to native hardware we can provide into the workflow (see this example: https://github.com/adilhusain-s/multi-arch-docker-example/blob/main/.github/workflows/native_docker_builder.yaml#L31).
Integrate a GitHub app connecting GHA builds to native hardware when needed to native hardware (experimental).
(not sure this is technically possible) Start the ppc64le QEMU-build asynchonously & don't let the PR wait for ppc64le build completion, so it doesn't block.

Note: Options 1, 2 & 6 are based on my observation that ppc64le is typically error-free when x86 builds without errors. Hence, we can typically accept PRs when only x86 builds. Rare corner cases are then discovered on PR merge.

If exclusion (options 1 or 2) is OK, I'd go for this option & later migrate to option 3 once native ppc64le GHA support becomes available later this year. Option 4 is possible but would require some additional efforts and organization on our side; so I see it only as a backup option. Same for option 5. Option 6 has not been tested thus far, so I would not go for it at the moment.

lehrig · 2023-03-28T11:57:46Z

Here are some stats that help getting a feeling for those options:

Notebook-controller build

Native ppc64le: 2.2 min
QEMU ppc64le: 12 min and 47sec
Native x86: 2 min and 6 sec

Volume-web-app build

Native ppc64le: 8.21 min
QEMU ppc64le: 36 min and 35sec
Native x86: 6 min and 56 sec

Central-dashboard build

Native ppc64le: 3.22 min
QEMU ppc64le: 10 min and 26sec
Native x86: 1 min and 55 sec

lehrig · 2023-03-28T12:03:17Z

I discussed the options with the team. Here is our proposal:

On PR opened, we recommend option 2: only build PRs if they don't take too long and otherwise disable ppc64le builds like we showcase in Docker Fixes: Ensure multiple architecture build is disabled for PRs DSpace/dspace-angular#1667
Looking at above stats, we believe "not too long" holds for all builds <= 30 min.
As soon as option 3 becomes available, migrate all workflows to this option: run everything natively and enable it for all PR opened.
On PR merged, we recommend to always build all supported architectures.
We also recommend to generally improve build performance by enabling caching during builds, which will generally lower building times by 30-40%: Improve build performance via caching #779. If that is enabled, we will get more components under the 30 min. threshold.

@kimwnasptd, does that sound good? What do you think about the 30 min. threshold?

lehrig · 2023-03-28T12:18:28Z

Still have to answer this:

From your experience, when is it most probable case for the x86 build to succeed but for the ppc64le to fail?

Seldomly happens once ppc64le support is there. The only case that is a bit harder are new 3rd-party dependencies, for example, additional python wheels unavailable on ppc64le (wheels are thus in Phase 3 of this endeavor). With Go/Java/JS code we typically don't see these kind of issues as they are more architecturally independent than then Python ecosystem.

kimwnasptd · 2023-06-04T10:48:42Z

@lehrig thangs for the detailed explanation! I agree with your proposal and rationale. So my current understanding is the following, but please tell me if I miss something:

We can skip building multi-platform images when testing PRs, since we don't expect any issues
Build for all architectures when a PR is merged, and GHA build and publish images
Once we have native ppc64le support for GHA out-of-the-box we can migrate the workflows to use it

At the same time we can also work on caching in parallel #779. Also if we in the future we see that there are a lot of issues when building/pushing the images between architectures we can then come back to evaluating building multi-arch images during opened PRs.

lehrig · 2023-08-10T10:51:18Z

Updated list of integrations by expanding phases & adding some smaller images for KServe + Katib. KFP is still moving slowly as it builds in another CI system, so we will first focus on KServe and Katib more.

andreyvelich · 2024-10-17T15:48:15Z

Let's continue this discussion in the community repo.
/transfer community

google-oss-prow bot added the kind/feature label Oct 25, 2022

lehrig mentioned this issue Oct 25, 2022

Add lehrig to the Member List kubeflow/internal-acls#581

Merged

kimwnasptd mentioned this issue Nov 10, 2022

Notebooks WG Roadmap for KF 1.7 kubeflow/kubeflow#6672

Closed

lehrig mentioned this issue Jan 4, 2023

adding support for linux-ppc64le in CI kubeflow/mpi-operator#487

Closed

pranavpandit1 mentioned this issue Jan 9, 2023

Contributing to pipeline CICD for ppc64le arch kubeflow/pipelines#8660

Closed

lehrig mentioned this issue Jan 16, 2023

Updating central-board Dockerfile for multi-arch support kubeflow/kubeflow#6861

Merged

lehrig mentioned this issue Feb 15, 2023

Adding multi-arch support for linux/ppc64le in ml-metadata google/ml-metadata#171

Open

kimwnasptd mentioned this issue Feb 21, 2023

create a list of Notebook features in Kubeflow 1.7 kubeflow/kubeflow#6932

Closed

kimwnasptd mentioned this issue Mar 23, 2023

Adding changes to build admission-webhook on pull_request kubeflow/kubeflow#7055

Merged

lehrig mentioned this issue Oct 16, 2024

Improve build performance via caching #779

Open

ghatwala mentioned this issue Jan 4, 2024

Add support for ppc64le argoproj/argo-workflows#12449

Open

lehrig mentioned this issue Mar 7, 2024

updated dockerfiles for grpc builds for powerPC compilation~ kubeflow/katib#2262

Closed

google-oss-prow bot transferred this issue from kubeflow/kubeflow Oct 17, 2024

npanpaliya mentioned this issue Nov 25, 2024

Add ppc64le platform support to Storage Initializer kserve/kserve#3918

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Umbrella Issue: Porting Kubeflow to IBM Power (ppc64le) #781

Umbrella Issue: Porting Kubeflow to IBM Power (ppc64le) #781

lehrig commented Oct 25, 2022 •

edited

Loading

kimwnasptd commented Oct 31, 2022

lehrig commented Nov 3, 2022 •

edited

Loading

kimwnasptd commented Nov 10, 2022

lehrig commented Nov 23, 2022

kimwnasptd commented Dec 2, 2022

kimwnasptd commented Feb 15, 2023

kimwnasptd commented Feb 15, 2023

lehrig commented Feb 15, 2023

pranavpandit1 commented Feb 17, 2023

lehrig commented Feb 17, 2023

kimwnasptd commented Feb 21, 2023

kimwnasptd commented Mar 24, 2023 •

edited

Loading

lehrig commented Mar 27, 2023 •

edited

Loading

lehrig commented Mar 28, 2023

lehrig commented Mar 28, 2023 •

edited

Loading

lehrig commented Mar 28, 2023 •

edited

Loading

kimwnasptd commented Jun 4, 2023

lehrig commented Aug 10, 2023

andreyvelich commented Oct 17, 2024

Umbrella Issue: Porting Kubeflow to IBM Power (ppc64le) #781

Umbrella Issue: Porting Kubeflow to IBM Power (ppc64le) #781

Comments

lehrig commented Oct 25, 2022 • edited Loading

kimwnasptd commented Oct 31, 2022

lehrig commented Nov 3, 2022 • edited Loading

kimwnasptd commented Nov 10, 2022

lehrig commented Nov 23, 2022

kimwnasptd commented Dec 2, 2022

kimwnasptd commented Feb 15, 2023

kimwnasptd commented Feb 15, 2023

lehrig commented Feb 15, 2023

pranavpandit1 commented Feb 17, 2023

lehrig commented Feb 17, 2023

kimwnasptd commented Feb 21, 2023

kimwnasptd commented Mar 24, 2023 • edited Loading

lehrig commented Mar 27, 2023 • edited Loading

lehrig commented Mar 28, 2023

Notebook-controller build

Volume-web-app build

Central-dashboard build

lehrig commented Mar 28, 2023 • edited Loading

lehrig commented Mar 28, 2023 • edited Loading

kimwnasptd commented Jun 4, 2023

lehrig commented Aug 10, 2023

andreyvelich commented Oct 17, 2024

lehrig commented Oct 25, 2022 •

edited

Loading

lehrig commented Nov 3, 2022 •

edited

Loading

kimwnasptd commented Mar 24, 2023 •

edited

Loading

lehrig commented Mar 27, 2023 •

edited

Loading

lehrig commented Mar 28, 2023 •

edited

Loading

lehrig commented Mar 28, 2023 •

edited

Loading