This is a local LLM (Large Language Model) service running in a Docker container. It uses the google-gemma-2-2b model and anticipates CUDA-compatible hardware for optimal performance. The application leverages Gunicorn to provide an easy-to-use API for generating text based on prompts, and it's packaged for simple deployment using hugging face transformers.
my_docker_project/
model/your-model-name/
|— safetensors
|— modelconfig
|— other files from trained model
offload/
|— volume for offloading durring inferene
|— Dockerfile
|— app.py
|— requirements.txt
|— README.md
|— .dockerignore
|— .env (not included)
- Docker installed on your system. You can find installation instructions here.
- NVIDIA drivers and CUDA installed for GPU support. Refer to the NVIDIA CUDA Toolkit Documentation for more information.
- Two local directories one for offloading model sections durreing inference and one for storing the trained model
- A .env file with a token for the required model from hugginface
To build the Docker image, run the following command from the root of the project directory:
docker build -t localgemma .
The Dockerfile will try to build using the NVIDIA runtime PyTorch image first. If it fails, it will fall back to a base CUDA image and create the environment from scratch.
To run the Docker container, use the following command:
docker run --gpus all -p 6000:6000 --name localgemma \
-v "$MODEL_DIR":/app/model/models--google--gemma-2-2b \
-v "$OFFLOAD_DIR":/app/offload \
-e LLAMA_TOKEN="$LLAMA_TOKEN" \
localgemma
The Flask application will be accessible at http://localhost:5440
.
/test
(GET): verifies the llm responds to prompt- more coming soon
- Make sure your machine has a compatible GPU and drivers if you want to take advantage of CUDA for model inference.
- Cuda and the toolkit take some time to impliment for eacb machine
- Update the
model_path
inapp.py
to point to the correct model location or adjust to use a model available from Hugging Face's hub.
- Docker: Container platform for building and running applications.
- NVIDIA CUDA Toolkit: Toolkit for developing GPU-accelerated applications.
- Hugging Face: Platform for sharing machine learning models and datasets.
- LangChain: Framework for building applications with LLMs.
- Ollama: A solution for deploying LLMs locally or in cloud environments.
These resources provide further information on the technologies used in this project and can help with expanding the current implementation.