This repo containerizes flan-t5 into a serving container using fastapi.
Since this uses Huggingface's transformers library, specifically the AutoModeForSeq2SeqLM
class, it should be able to load other models supported by the library.
The model license can be found here.
Features:
- Text generation.
- Language translation.
- Sentiment analysis.
- text classification.
-
Clone repo if you haven't. Navigate to the
serving-flant5
folder. -
Build container. Don't forget to change the
project_id
to yours.docker build --build-arg model_name=google/flan-t5-large . -t gcr.io/{project_id}/serving-t5:latest
-
Run container. You need NVIDIA docker and a GPU.
docker run -p 80:8080 --gpus all -e AIP_HEALTH_ROUTE=/health -e AIP_HTTP_PORT=8080 -e AIP_PREDICT_ROUTE=/predict gcr.io/{project_id}/serving-t5:latest -d
-
Make predictions
python test_container.py
You'll need to enable Vertex AI and have authenticated with a service account that has the Vertex AI admin or editor role.
-
Push the image
gcloud auth configure-docker docker push gcr.io/{project_id}/serving-t5:latest
-
Deploy in Vertex AI Endpoints
python ../gcp_deploy.py --image-uri gcr.io/<project_id>/serving-t5:latest --machine-type n1-standard-8 --model-name flant5 --endpoint-name flant5-endpoint --endpoint-deployed-name flant5-deployed-name
-
Test the endpoint
python generate_request_vertex.py