-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(containers): experimentation with hugging face models #84
base: main
Are you sure you want to change the base?
Conversation
"file": "llama-2-7b.Q2_K.gguf", | ||
"source" : "https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q2_K.gguf", | ||
"size_gb": "2.83", | ||
"ctn_endpoint": "paste container endpoint here" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue(improvement): we can template this file with Terraform as part of the deployment. You can see an example of templating container URLs in this example where we template a shell script.
for _ in range(num_samples): | ||
try: | ||
print( | ||
"Calling model {model} on endpoint {endpoint} with message {message}".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue(syntax): use f-strings by default.
@@ -0,0 +1,33 @@ | |||
# Hugging Face Models | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue(docs): this README has a lot of good technical detail, but no high-level explanation of what the example does. We need to explain what the example does, what SCW resources it uses, and link to Hugging Face and the models used (and any interesting Python libraries we use too).
@@ -0,0 +1,33 @@ | |||
# Hugging Face Models | |||
|
|||
### Deploy models in Serverless Containers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue(structure): our examples should all use the standard README format included in the top-level of the repo.
- Export these variables: | ||
|
||
```bash | ||
export SCW_ACCESS_KEY="access-key" SCW_SECRET_KEY="secret-key" SCW_PROJECT_ID="project-id" REGION="fr-par" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion(tfvars): make these Terraform variables and give region a default of fr-par
.
```bash | ||
cd terraform && bash terraform.sh -a | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue(docs): can you add a section on how to call one of the inference endpoints?
If you add the endpoints as a Terraform output, you can write a command that you can copy-paste using terraform output
. See the other examples on how to do this.
There should be a command to call the "hello" endpoint to check they are working, then ideally a command for how to get an inference decision.
@@ -0,0 +1,56 @@ | |||
#!/bin/bash | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue(scripting): These kinds of commands should be managed in a Makefile
.
eval "$(jq -r '.[]|.[]|"hf_models[\(.file)]=\(.source)"' hf-models.json)" | ||
|
||
# Login to docker Scaleway's registry | ||
docker login "rg.$REGION.scw.cloud" -u nologin --password-stdin <<< "$SCW_SECRET_KEY" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion(simplify): We don't need to log into the repo every time, this can be a one-off step at the start (and listed in the README).
# Initialize, plan, and deploy each model in a Terraform workspace | ||
apply() { | ||
terraform init | ||
for model_file_name in "${!hf_models[@]}"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question(terraform): Instead of using bash for-loops here, can we use for_each
in the Terraform files to iterate over the list of models in a variable?
|
||
RUN pip install -r requirements.txt | ||
|
||
RUN pip install llama-cpp-python==0.2.62 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can also include it in the requirements.txt directly:
flask=3.0.5
...
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
llama-cpp-python==0.2.62
region = var.region | ||
access_key = var.access_key | ||
secret_key = var.secret_key | ||
project_id = var.project_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: IMO this is unnecessary, the default behavior of the provider is to use your config file or the environment to get its configuration, so I would leave it blank
|
||
app = FastAPI() | ||
|
||
print("loading model starts", flush=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: An alternative to adding flush=True
to every print statement is to pass "PYTHONUNBUFFERED=1` as an env var (or even set it in the Dockerfile) to avoid buffering stdout
Summary
In this example, we experiment deploying some lightweight Hugging Face models (phi, llama, and mistral) using a public HF model repository. The deployment of these models is done using Terraform. They are bench-marked using a python script.
Checklist