Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(containers): experimentation with hugging face models #84

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

redanrd
Copy link
Contributor

@redanrd redanrd commented Apr 22, 2024

Summary

In this example, we experiment deploying some lightweight Hugging Face models (phi, llama, and mistral) using a public HF model repository. The deployment of these models is done using Terraform. They are bench-marked using a python script.

Checklist

  • I have reviewed this myself.
  • I have attached a README to my example. You can use this template as reference.
  • I have updated the project README to link my example.

@redanrd redanrd changed the title feat(containers): inference with hugging face models feat(containers): experimentation with hugging face models May 7, 2024
@redanrd redanrd marked this pull request as ready for review May 7, 2024 10:53
@redanrd redanrd requested review from Shillaker and cyclimse May 7, 2024 14:01
"file": "llama-2-7b.Q2_K.gguf",
"source" : "https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q2_K.gguf",
"size_gb": "2.83",
"ctn_endpoint": "paste container endpoint here"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue(improvement): we can template this file with Terraform as part of the deployment. You can see an example of templating container URLs in this example where we template a shell script.

for _ in range(num_samples):
try:
print(
"Calling model {model} on endpoint {endpoint} with message {message}".format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue(syntax): use f-strings by default.

@@ -0,0 +1,33 @@
# Hugging Face Models

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue(docs): this README has a lot of good technical detail, but no high-level explanation of what the example does. We need to explain what the example does, what SCW resources it uses, and link to Hugging Face and the models used (and any interesting Python libraries we use too).

@@ -0,0 +1,33 @@
# Hugging Face Models

### Deploy models in Serverless Containers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue(structure): our examples should all use the standard README format included in the top-level of the repo.

- Export these variables:

```bash
export SCW_ACCESS_KEY="access-key" SCW_SECRET_KEY="secret-key" SCW_PROJECT_ID="project-id" REGION="fr-par"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(tfvars): make these Terraform variables and give region a default of fr-par.

```bash
cd terraform && bash terraform.sh -a
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue(docs): can you add a section on how to call one of the inference endpoints?

If you add the endpoints as a Terraform output, you can write a command that you can copy-paste using terraform output. See the other examples on how to do this.

There should be a command to call the "hello" endpoint to check they are working, then ideally a command for how to get an inference decision.

@@ -0,0 +1,56 @@
#!/bin/bash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue(scripting): These kinds of commands should be managed in a Makefile.

eval "$(jq -r '.[]|.[]|"hf_models[\(.file)]=\(.source)"' hf-models.json)"

# Login to docker Scaleway's registry
docker login "rg.$REGION.scw.cloud" -u nologin --password-stdin <<< "$SCW_SECRET_KEY"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(simplify): We don't need to log into the repo every time, this can be a one-off step at the start (and listed in the README).

# Initialize, plan, and deploy each model in a Terraform workspace
apply() {
terraform init
for model_file_name in "${!hf_models[@]}";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question(terraform): Instead of using bash for-loops here, can we use for_each in the Terraform files to iterate over the list of models in a variable?


RUN pip install -r requirements.txt

RUN pip install llama-cpp-python==0.2.62 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can also include it in the requirements.txt directly:

flask=3.0.5
...
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
llama-cpp-python==0.2.62 

Comment on lines +2 to +5
region = var.region
access_key = var.access_key
secret_key = var.secret_key
project_id = var.project_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: IMO this is unnecessary, the default behavior of the provider is to use your config file or the environment to get its configuration, so I would leave it blank


app = FastAPI()

print("loading model starts", flush=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: An alternative to adding flush=True to every print statement is to pass "PYTHONUNBUFFERED=1` as an env var (or even set it in the Dockerfile) to avoid buffering stdout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants