Fine-Tuning a Foundation Model for Multiple Tasks (with QLoRA)

This repository demonstrates how to use PEFT (Parameter-Efficient Fine-Tuning) and distribution techniques to fine-tune open source LLM (Large Language Model) for downstream language tasks.

AMP Overview

In this AMP we show you how to implement LLM fine-tuning jobs that make use of the QLoRA and Accelerate implementations available in the PEFT open-source library from Huggingface and an example application that swaps the fine-tuned adapters in real time for inference targetting different tasks.

The fine-tuning examples for 3 different tasks are created as CML Jobs that can be run to reproduce the sample model adapters included in this AMP repo in ./amp_adapters_prebuilt.

AMP Setup

Configurable Options

NUM_GPU_WORKERS: Configurable project environment variable set up for this AMP. This is the TOTAL number of distributed GPUs that the fine-tuning jobs will make use of during runtime. (Each individual fine-tuning worker will use a single GPU). If 1 is set, fine-tuning will happen on a single session only without distribution.

AMP Concepts

Fine-tuning optimization techniques

In this AMP we show how you can use cutting edge fine-tuning techniques to effictiently produce adapters finetuned for language tasks in CML.

PEFT

PEFT (Parameter-Efficient Fine-Tuning) are a class of fine-tuning techniques which allow for the effecient adaptation of LLMs to downstream tasks, training only a small amount of extra model parameters.

Full fine-tuning is effective and has been the default way to apply base LLMs to different tasks, but is now seen as inefficient due to the ballooning size of LLMs and datasets. PEFT techniques promise time and cost efficiecnt fine-tuning pipelines and in the case of LoRA, adapters that can be remerged with the base model to create a new model or used as portable model adapters at runtime.
QLoRA

One of the PEFT techniques officially supported in the huggingface library is QLoRA (Quantized Low Rank Adaptation). This fine-tuning technique is the result of two papers, the original LoRA and following QLoRA.
- LoRA fine-tuning freezes the original model parameters and trains a new small set of parameters with a dataset, at lower memory footprint and time therefore lower cost for still very effective learning. Furthermore, the adapter matrices are a fraction of the size of the base model weights and can be swapped out at inference time without changing the base model weights. Inference with LoRA as described by the LoRA paper. During fine-tuning time W is frozen while A and B contain the trainable parameters.
- QLoRA further increased efficiency, by using a new quantized data type and additional quanitization and memory optimization techniques to further drive the time and cost down.
This allows us to use lower cost GPUs compared to full parameter fine-tuning, while still matching the performance of more intensive and costly full fine-tuning.

All of the libraries required for configuring and launching QLoRA finetuning are available via from huggingface see requirements.txt. Implementation examples from the huggingface can be founde here
Distributed Training

Using the PEFT and TRL open-source libraries from Huggingface means we also have easy access to accelerate. Another Huggingface library which abstracts away the use of multiple GPUs and machines for fine-tuning jobs. As with many other kinds of distributed workloads, this cuts down on the time to fine-tune dramatically.

As the name implies, distributed fine-tuning in accelerate means performing training operations on distributed setups. This could be multi-host with gpus on different hosts or multi-gpu with all gpus on the same host.

Since TRL is integrated with accelerate already, it takes care of the data distribution and model duplication. See this blog for a breakdown of the abstraction layers huggingface have implemented that makes this convenient for us.

CML is able to run accelerate distributed fine-tuning in multi-host mode out of the box using the CML Workers API. Each CML Worker is launched in kubernetes and treated as a separate host with its own GPU, trl and accelerate do the rest!

Sample Fine-tuned Tasks

For each the following fine-tuning tasks we start with the smaller LLM bigscience/bloom-1b1. This model was chosen for its tiny size and permissive license for commercial and research uses. The small size of this base model results in very short fine-tuning times and portable adapters that are simple to run for anyone looking to try this AMP (at the cost of poorer general performance.)

A larger base model or a base model from another LLM family could also be used with the same techniques shown in the scripts and sample notebook in this repository.

An update to make this easier to do within this AMP is coming soon!

Each included sample adapter is fine-tuned on portions of publicly available datasets that have been mapped to fit desired inference patterns. While none of the included trained adapters are capable enough for production applications (* see Improving on the Sample Adapters), each demonstrates clear task performance improvement over the base model with minimal training time by making use of the PEFT techniques described above.

The use of separate adapters means that in the Task Explorer application only a single copy of the Base Model is loaded, allowing for on-demand swapping of task fine-tuned adapters.

General Instruction Following

Training Time/Cost: (8m19s / $0.82) distributed on 2x P3.2xlarge AWS instances
Dataset: https://huggingface.co/datasets/teknium/GPTeacher-General-Instruct (mit)
- Contains 89k examples of instruction-input-response text

SQL English to Query

Training Time/Cost: (22m11s / $2.21) distributed on 2x P3.2xlarge AWS instances
Dataset: https://huggingface.co/datasets/philschmid/sql-create-context-copy (cc-by-4.0)
- Contains 80k examples of Question - Table - SQL text-to-sql strings

Detoxifying English Text

Training Time/Cost: (8m50s / $0.86) distributed on 2x P3.2xlarge AWS instances
Dataset: https://huggingface.co/datasets/s-nlp/paradetox (afl-3.0)
- Contains 19k examples of toxic to neutral wording conversions in english

Reproducing the prebuilt adapters

To see the training jobs running in CML firsthand, just start any of the fine-tuning jobs in the Jobs tab of your project. These will launch fine-tuning with accelerate distributed across multiple CML workers.

Adapters created via these jobs will appear in the Task Explorer Application after a restart and will be prefixed with a Custom: label.

Custom fine-tuned adapters

If you would like to attempt fine-tuning with a different data set or a different data mapping function for the prompt:

Adapt Jupyter Notebook Example for a simple fine-tuning script OR Implementation Details for launching distributed fine-tuning with accelerate.
Ensure your fine-tuned adapter is placed in amp_adapters_custom
Restart the Task Explorer Application

Note: The Task Explorer Application assumes the use of bloom1b1 as the base model, a future update will make it simpler to use a different base model for fine-tuning and inference.

Task Explorer Application

The task explorer application in 4_app-task-explorer loads up all the adapters found in amp_adapters_prebuilt and amp_adapters_custom. These are loaded on top of the base model for inference comparison.

Within the application environment, we make use of the novel benefit of LoRA adapters that allow for swapping of the fine-tuned adapter at runtime to make inference on multiple bespoke tasks seamless and all on a single GPU.

Implementation Details

See detailed implementation descriptions in distributed_peft_scripts/README.md

AMP Requirements

CPU

CML CPU workloads with resource profiles up to (2 vCPU / 18 GiB Memory) will be provisioned

GPU

Minimum of nVidia V100 with 16GB vram is required (AWS p3.2xlarge)
1+ CML GPU workloads with resource profile (2 vCPU / 16 GiB Memory / 1 GPU) will be provisioned
- Fine-tuning Examples (Optional)
  - A single gpu will run fine-tuning examples only in non-distributed mode
  - Multiple gpus will be required to run fine-tuning examples distributed across multiple CML sessions.
- Application Inference
  - The task explorer application will require 1 GPU to perform inference

CML Runtime

PBJ Workbench - Python 3.9 - Nvidia GPU - 2023.05

Jupyter Notebook Examples

We have included notebook examples that demonstrate the code in the rest of the AMP project in a more digestible manner.

QLoRA Fine-tuning example

A notebook example is provided to demonstrate what the fine-tuning techniques and libraries look like in a single script.

Recommended Runtime

PBJ Workbench - Python 3.9 - Nvidia GPU - 2023.05

Recommended Resource Profile

2 vCPU / 16 GiB Memory / 1 GPU

QLoRA Fine-tuning example

A notebook example is provided to demonstrate what accelerate distributed training (fine-tuning) looks like in CML using the CML Workers API.

Recommended Runtime

JupyterLab - Python 3.9 - Nvidia GPU - 2023.05

Recommended Resource Profile

2 vCPU / 16 GiB Memory / 1 GPU

1 additional GPU for disctributed worker

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
amp_0_session-configuration-validation		amp_0_session-configuration-validation
amp_1_session-install-deps		amp_1_session-install-deps
amp_2_session-resource-validation		amp_2_session-resource-validation
amp_3_job_fine_tune		amp_3_job_fine_tune
amp_4_app-task-explorer		amp_4_app-task-explorer
amp_adapters_prebuilt		amp_adapters_prebuilt
guides		guides
images		images
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.project-metadata.yaml		.project-metadata.yaml
README.md		README.md
notebook_accelerate_distributed_fine_tune_sample.ipynb		notebook_accelerate_distributed_fine_tune_sample.ipynb
notebook_local_fine_tune_sample.ipynb		notebook_local_fine_tune_sample.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fine-Tuning a Foundation Model for Multiple Tasks (with QLoRA)

AMP Overview

AMP Setup

Configurable Options

AMP Concepts

Fine-tuning optimization techniques

PEFT

QLoRA

Distributed Training

Sample Fine-tuned Tasks

General Instruction Following

SQL English to Query

Detoxifying English Text

Reproducing the prebuilt adapters

Custom fine-tuned adapters

Task Explorer Application

Implementation Details

AMP Requirements

CPU

GPU

CML Runtime

Jupyter Notebook Examples

QLoRA Fine-tuning example

Recommended Runtime

Recommended Resource Profile

QLoRA Fine-tuning example

Recommended Runtime

Recommended Resource Profile

About

Releases

Packages

Languages

asong-c/CML_LLM_PEFT

Folders and files

Latest commit

History

Repository files navigation

Fine-Tuning a Foundation Model for Multiple Tasks (with QLoRA)

AMP Overview

AMP Setup

Configurable Options

AMP Concepts

Fine-tuning optimization techniques

PEFT

QLoRA

Distributed Training

Sample Fine-tuned Tasks

General Instruction Following

SQL English to Query

Detoxifying English Text

Reproducing the prebuilt adapters

Custom fine-tuned adapters

Task Explorer Application

Implementation Details

AMP Requirements

CPU

GPU

CML Runtime

Jupyter Notebook Examples

QLoRA Fine-tuning example

Recommended Runtime

Recommended Resource Profile

QLoRA Fine-tuning example

Recommended Runtime

Recommended Resource Profile

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages