Split AI Lab Recipes from RHEL AI Images #771

cooktheryan · 2024-08-28T17:59:02Z

Currently it is very difficult to understand how to contribute new recipes to this repository as it has grown to include additional things outside the scope of the podman extension recipes. The idea would be to somehow split the repositories so that the various stakeholders still have what is needed while making it easy for the community or RH contributors to add content to the individual pieces important to them.

/cc @sallyom @MichaelClifford @rhatdan @Gregory-Pereira

rhatdan · 2024-08-28T18:29:08Z

There is talk about splitting out the training section into their own repositories. The question is whether their should be 1 or three. ai-training-amd or ai-training/amd

fabiendupont · 2024-08-29T07:36:58Z

Yes, I have started the exercise to move the bootc images outside of ai-lab-recipes.

Here is an example Github org showing what it could look like: https://github.com/smgglrs-ai/.
The images are pushed to quay.io/smgglrs-ai and a sample of CentOS Stream images is already present. I have confirmed that RHEL images build, but not pushed them.
Fedora tends to be a bit more complicated, because none of the vendors provide RPMs for Fedora 40. So, I shelved Fedora for now.

These images are meant to be used as base images to install AI Lab recipes, so they only have the hardware enablement components and no prebaked application container images or cloud specific tools.

In my opinion, the application container images should be added as we specialize the image for a given recipe. And if we want to ship to a specific cloud, we should add the relevant packages during the final image (AMI, VHD, etc... build, probably as an image builder feature.

slemeur · 2024-08-29T07:56:44Z

cc @jeffmaury @benoitf

fabiendupont · 2024-08-29T13:02:14Z

Here is a proposal for creating new repositories under https://github.com/containers:

driver-toolkit

This container image can be used by any stack to build out-of-tree drivers for a given kernel.
The images will be tagged with the kernel version, so it's easy to know which kernel it can be used for.
To build images, one would have to pass a build argument with the kernel version. This can be found via skopeo inspect in the Makefile.

bootc-amd-rocm, bootc-intel-gaudi, bootc-nvidia-cuda

The bootc images are derived from the {fedora,centos,rhel}-bootc images. They enable the hardware accelerator for a given stack, up to the container runtime configuration.
The naming convention includes both the vendor and the stack, in order to allow multiple stacks per vendor. For example, the Intel Gaudi and Falcon Shore will coexist.
The output of this repository is base images without any pre-loaded container image, letting users layer them in a separate flow, keeping the bootc-<vendor> images generic. We would keep the additional storage configuration, so that users only have to use podman pull --root /usr/lib/containers/storage.

Cleanup

The other folders under training could be removed at this stage.
The deepspeed, instructlab, model and vllm image have been combined in instructlab, which is built from https://github.com/instructlab/instructlab, with its own lifecycle.
The ilab wrapper could be contributed to the InstructLab project as a way to hide the complexity of the podman/docker command. It is useful in general.
The upgrade-informer logic could become a standalone RPM that is used in all bootc images, if we think it's valuable. It doesn't really belong to AI Lab.
The tests should also be split into the new repositories to provide stack specific test suites.

If we need more images for specific recipes, we can create new repositories or add them to the recipes folder, based on the level of dependency of their lifecycles.
However, I think it is better to contribute to the upstream projects, including build recipes. We can contribute Containerfiles based on Fedora for bleeding edge, as well as CentOS Stream for Enterprise Linux incubation.

rhatdan · 2024-09-04T12:36:00Z

Why such a huge proliferation of repos? Why not keep them under a bootc-ai repo? or something similarly named.
ai-containers?

fabiendupont · 2024-09-06T07:57:08Z

The have different lifecycles and require different expertise. An AMD stack contributor may not be relevant for NVIDIA code reviews. And we're currently talking about splitting the repository because of the proliferation of subfolder which complexifies the whole structure.

lmilbaum · 2024-09-06T09:45:28Z

Another reason would be CI complexity. The more artifacts the more complex CI.

rhatdan · 2024-09-09T21:07:44Z

But there is also interaction between these repos, in some cases we want to share content, and not force people to open up the same change in three different repositories. Finally these REPOS are going to be fairly tiny. just a couple of Containerfiles?

fabiendupont · 2024-09-13T12:43:59Z

These repositories have a similar structure, but they don't really share much. The only thing that is identical is the update service, which could become an RPM to be shipped independently.

jeffmaury · 2024-09-13T15:44:48Z

So my understanding is that this repo will have model-servers and recipes kept so this is good for us (Podman AI Lab team)

rhatdan · 2024-09-16T18:46:53Z

Yes just training is moving out.

fabiendupont · 2024-09-17T15:44:02Z

Actually, we could keep the training folder for training recipes, but would move most of the current artifacts, because they are not AI recipes.

rhatdan · 2024-09-18T13:57:41Z

There are no "recipes" for training, this was just thrown there so that we could start the process of building a AI Training project. It can be moved out without affecting other uses of ai-lab-recipes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split AI Lab Recipes from RHEL AI Images #771

Split AI Lab Recipes from RHEL AI Images #771

cooktheryan commented Aug 28, 2024

rhatdan commented Aug 28, 2024

fabiendupont commented Aug 29, 2024

slemeur commented Aug 29, 2024

fabiendupont commented Aug 29, 2024

rhatdan commented Sep 4, 2024

fabiendupont commented Sep 6, 2024

lmilbaum commented Sep 6, 2024

rhatdan commented Sep 9, 2024

fabiendupont commented Sep 13, 2024

jeffmaury commented Sep 13, 2024

rhatdan commented Sep 16, 2024

fabiendupont commented Sep 17, 2024

rhatdan commented Sep 18, 2024

Split AI Lab Recipes from RHEL AI Images #771

Split AI Lab Recipes from RHEL AI Images #771

Comments

cooktheryan commented Aug 28, 2024

rhatdan commented Aug 28, 2024

fabiendupont commented Aug 29, 2024

slemeur commented Aug 29, 2024

fabiendupont commented Aug 29, 2024

rhatdan commented Sep 4, 2024

fabiendupont commented Sep 6, 2024

lmilbaum commented Sep 6, 2024

rhatdan commented Sep 9, 2024

fabiendupont commented Sep 13, 2024

jeffmaury commented Sep 13, 2024

rhatdan commented Sep 16, 2024

fabiendupont commented Sep 17, 2024

rhatdan commented Sep 18, 2024