Onboard llm google-gemma-2-2b

This is a local LLM (Large Language Model) service running in a Docker container. It uses the google-gemma-2-2b model and anticipates CUDA-compatible hardware for optimal performance. The application leverages Gunicorn to provide an easy-to-use API for generating text based on prompts, and it's packaged for simple deployment using hugging face transformers.

Project Structure

my_docker_project/
model/your-model-name/
  |— safetensors
  |— modelconfig
  |— other files from trained model
offload/
  |— volume for offloading durring inferene
|— Dockerfile
|— app.py
|— requirements.txt
|— README.md
|— .dockerignore
|— .env (not included)

Setup and Running Instructions

Prerequisites

Docker installed on your system. You can find installation instructions here.
NVIDIA drivers and CUDA installed for GPU support. Refer to the NVIDIA CUDA Toolkit Documentation for more information.
Two local directories one for offloading model sections durreing inference and one for storing the trained model
A .env file with a token for the required model from hugginface

Build the Docker Image

To build the Docker image, run the following command from the root of the project directory:

docker build -t localgemma .

The Dockerfile will try to build using the NVIDIA runtime PyTorch image first. If it fails, it will fall back to a base CUDA image and create the environment from scratch.

Run the Docker Container

To run the Docker container, use the following command:

docker run --gpus all -p 6000:6000 --name localgemma \
  -v "$MODEL_DIR":/app/model/models--google--gemma-2-2b  \
  -v "$OFFLOAD_DIR":/app/offload \
  -e LLAMA_TOKEN="$LLAMA_TOKEN" \
  localgemma

The Flask application will be accessible at http://localhost:5440.

Endpoints

/test (GET): verifies the llm responds to prompt
more coming soon

Notes

Make sure your machine has a compatible GPU and drivers if you want to take advantage of CUDA for model inference.
Cuda and the toolkit take some time to impliment for eacb machine
Update the model_path in app.py to point to the correct model location or adjust to use a model available from Hugging Face's hub.

Additional Resources

Docker: Container platform for building and running applications.
NVIDIA CUDA Toolkit: Toolkit for developing GPU-accelerated applications.
Hugging Face: Platform for sharing machine learning models and datasets.
LangChain: Framework for building applications with LLMs.
Ollama: A solution for deploying LLMs locally or in cloud environments.

These resources provide further information on the technologies used in this project and can help with expanding the current implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.py		main.py
requirements.txt		requirements.txt
start_llm_service.sh		start_llm_service.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Onboard llm google-gemma-2-2b

Project Structure

Setup and Running Instructions

Prerequisites

Build the Docker Image

Run the Docker Container

Endpoints

Notes

Additional Resources

About

Releases

Packages

Contributors 2

Languages

License

hammerdirt-analyst/onboardllm

Folders and files

Latest commit

History

Repository files navigation

Onboard llm google-gemma-2-2b

Project Structure

Setup and Running Instructions

Prerequisites

Build the Docker Image

Run the Docker Container

Endpoints

Notes

Additional Resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages