Skip to content

An llm endpoint for local inference. Uses google-gemma-2-2b and huggingface transformers

License

Notifications You must be signed in to change notification settings

hammerdirt-analyst/onboardllm

Repository files navigation

Onboard llm google-gemma-2-2b

This is a local LLM (Large Language Model) service running in a Docker container. It uses the google-gemma-2-2b model and anticipates CUDA-compatible hardware for optimal performance. The application leverages Gunicorn to provide an easy-to-use API for generating text based on prompts, and it's packaged for simple deployment using hugging face transformers.

Project Structure

my_docker_project/
model/your-model-name/
  |— safetensors
  |— modelconfig
  |— other files from trained model
offload/
  |— volume for offloading durring inferene
|— Dockerfile
|— app.py
|— requirements.txt
|— README.md
|— .dockerignore
|— .env (not included)

Setup and Running Instructions

Prerequisites

  • Docker installed on your system. You can find installation instructions here.
  • NVIDIA drivers and CUDA installed for GPU support. Refer to the NVIDIA CUDA Toolkit Documentation for more information.
  • Two local directories one for offloading model sections durreing inference and one for storing the trained model
  • A .env file with a token for the required model from hugginface

Build the Docker Image

To build the Docker image, run the following command from the root of the project directory:

docker build -t localgemma .

The Dockerfile will try to build using the NVIDIA runtime PyTorch image first. If it fails, it will fall back to a base CUDA image and create the environment from scratch.

Run the Docker Container

To run the Docker container, use the following command:

docker run --gpus all -p 6000:6000 --name localgemma \
  -v "$MODEL_DIR":/app/model/models--google--gemma-2-2b  \
  -v "$OFFLOAD_DIR":/app/offload \
  -e LLAMA_TOKEN="$LLAMA_TOKEN" \
  localgemma

The Flask application will be accessible at http://localhost:5440.

Endpoints

  • /test (GET): verifies the llm responds to prompt
  • more coming soon

Notes

  • Make sure your machine has a compatible GPU and drivers if you want to take advantage of CUDA for model inference.
  • Cuda and the toolkit take some time to impliment for eacb machine
  • Update the model_path in app.py to point to the correct model location or adjust to use a model available from Hugging Face's hub.

Additional Resources

  • Docker: Container platform for building and running applications.
  • NVIDIA CUDA Toolkit: Toolkit for developing GPU-accelerated applications.
  • Hugging Face: Platform for sharing machine learning models and datasets.
  • LangChain: Framework for building applications with LLMs.
  • Ollama: A solution for deploying LLMs locally or in cloud environments.

These resources provide further information on the technologies used in this project and can help with expanding the current implementation.

About

An llm endpoint for local inference. Uses google-gemma-2-2b and huggingface transformers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published