-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Umbrella Issue: Porting Kubeflow to IBM Power (ppc64le) #781
Comments
Thanks for creating this tracking issue @lehrig! I'm onboard with adding support for ppc64le, since this will greatly help KF adoption. The proposed plan makes sense. My initial question at this time is whether we need to build different executables for this platform, which means we need a new set of images. I see in the PRs that the only needed change is to actually not set a specific platform, but I might be missing something. Could you provide some more context on this one? |
@kimwnasptd, thanks for your support! There are essentially 2 options for publishing images:
IMO 1. should be the preferred solution. A challenge here will be that builds over all Kubeflow components are quite inconsistent. For example, some projects already use |
@lehrig I agree with the first approach as well, if it's viable to avoid having multiple manifests. Docker's |
As wished by @kimwnasptd, quoting myself from kubeflow/kubeflow#6650 (comment) to clarify how we envision multi-arch builds:
|
@lehrig @adilhusain-s! the first image in this repo with support for ppc64le is up! 🎉 |
@lehrig @pranavpandit1 @adilhusain-s I realized that right now we've only implemented the logic for using Realized that the Centraldashboard image was not getting build, even though the PR checks were green: We fixed the issue for this in kubeflow/kubeflow#6960, but we should be able to catch errors for the multi-arch build when a PR is opened as well. The fix should be straightforward. We'll just need to use the same build command in both types of actions. Referencing the relevant parts in one: |
Do you have cycles to help with this effort? |
@kimwnasptd let me confirm with the team but I think we can handle it |
@kimwnasptd: Thanks for all the inputs, |
Note I updated the main description by adding a phase for Kubeflow 1.8 scope and linking to this novel design document https://docs.google.com/document/d/1nGUvLonahoLogfWCHsoUOZl-s77YtPEiCjWBVlZjJHo/edit?usp=sharing We use this document to discuss Phase 2 with the KFP community (related to kubeflow/pipelines#8660). |
Thanks @pranavpandit1! I also took a look on how to get these to work, when fixing the workflows for the CentralDashboard. You can take a look at this PR and some comments kubeflow/kubeflow#6961 |
@lehrig I think we bumped into a side-effect that I hadn't thought about initially. Building the images in the GH Actions (which is doing virtualization via QEMU) is actually slow. Looking at an open PR kubeflow/kubeflow#7060 that touches some web apps I see the following:
The difference is huge, so I want to re-evaluate the approach of building for ppc64le when PRs are opened. From your experience, when is it most probable case for the x86 build to succeed but for the ppc64le to fail? |
@kimwnasptd yeah, I agree that this is suboptimal. The answer is obviously "it depends", however, I think we have some hard evidence here that we should not proceed as originally planned. I see the following options for builds on PR opened.
Note: Options 1, 2 & 6 are based on my observation that ppc64le is typically error-free when x86 builds without errors. Hence, we can typically accept PRs when only x86 builds. Rare corner cases are then discovered on PR merge. If exclusion (options 1 or 2) is OK, I'd go for this option & later migrate to option 3 once native ppc64le GHA support becomes available later this year. Option 4 is possible but would require some additional efforts and organization on our side; so I see it only as a backup option. Same for option 5. Option 6 has not been tested thus far, so I would not go for it at the moment. |
Here are some stats that help getting a feeling for those options: Notebook-controller build
Volume-web-app build
Central-dashboard build
|
I discussed the options with the team. Here is our proposal:
@kimwnasptd, does that sound good? What do you think about the 30 min. threshold? |
Still have to answer this:
Seldomly happens once ppc64le support is there. The only case that is a bit harder are new 3rd-party dependencies, for example, additional python wheels unavailable on ppc64le (wheels are thus in Phase 3 of this endeavor). With Go/Java/JS code we typically don't see these kind of issues as they are more architecturally independent than then Python ecosystem. |
@lehrig thangs for the detailed explanation! I agree with your proposal and rationale. So my current understanding is the following, but please tell me if I miss something:
At the same time we can also work on caching in parallel #779. Also if we in the future we see that there are a lot of issues when building/pushing the images between architectures we can then come back to evaluating building multi-arch images during opened PRs. |
Updated list of integrations by expanding phases & adding some smaller images for KServe + Katib. KFP is still moving slowly as it builds in another CI system, so we will first focus on KServe and Katib more. |
Let's continue this discussion in the community repo. |
/kind feature
Enable builds & releases for IBM Power (ppc64le architecture). This proposal was presented with these slides at the 2022-10-25 Kubeflow community call with positive community feedback. We also created this design documentation: https://docs.google.com/document/d/1nGUvLonahoLogfWCHsoUOZl-s77YtPEiCjWBVlZjJHo/edit?usp=sharing
Why you need this feature:
Describe the solution you'd like:
We currently plan to divide our efforts into multiply phases:
Below is a detailed overview of each required integration, including links to associated PRs if those already exist.
Phase 1 Integrations (Kubeflow 1.7 scope)
🚀 https://hub.docker.com/r/kubeflownotebookswg/poddefaults-webhook/tags
🚀 https://hub.docker.com/r/kubeflownotebookswg/centraldashboard/tags
🚀 https://hub.docker.com/r/kubeflownotebookswg/jupyter-web-app/tags
🚀 https://hub.docker.com/r/kserve/agent/tags
🚀 https://hub.docker.com/r/kserve/kserve-controller/tags
🚀 https://hub.docker.com/r/kserve/models-web-app/tags
🚀 https://hub.docker.com/r/kserve/qpext/tags
🚀 https://hub.docker.com/r/kserve/router/tags
🚀 https://hub.docker.com/r/mpioperator/mpi-operator/tags
🚀 https://hub.docker.com/r/kubeflownotebookswg/notebook-controller/tags
🚀 https://hub.docker.com/r/kubeflownotebookswg/profile-controller/tags
🚀 https://hub.docker.com/r/kubeflownotebookswg/kfam/tags
🚀 https://hub.docker.com/r/kubeflownotebookswg/notebook-controller/tags
🚀 https://hub.docker.com/r/kubeflownotebookswg/tensorboards-web-app/tags
🚀 https://hub.docker.com/r/kubeflow/training-operator/tags
🚀 https://hub.docker.com/r/kubeflownotebookswg/volumes-web-app/tags
Phase 2 Integrations (Kubeflow 1.9 scope)
Phase 3 Integrations (Kubeflow 1.10 scope)
Note: KFP is currently blocked by kubeflow/pipelines#8660 / GoogleCloudPlatform/oss-test-infra#1972
Phase 4 Integrations (Post Kubeflow 1.11 scope)
OIDC Auth (external): Enable oidc-authservice repository CI for power(ppc64le) architecture. arrikto/oidc-authservice#104; on-hold as potentially irrelevant as of Kubeflow v1.8 (Move away from AuthService manifests#2469)The text was updated successfully, but these errors were encountered: