Flover

A novel temporal fusion framework for propelling autoregressive model inference by leveraging the temporality property of inferring generative tasks (also known as in-flight batching, iteration batching, etc.), delivering superior and more fine-grained parallelism beyond all current solutions. The code is now available. Please refer to arxiv for details. Our paper is in the proceedings of 30th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 23).

Installation

Environment

Flover is based on FasterTransformer, thus, you can follow it. We provide the option to build the project from source, make sure MPI and NCCL are installed in your system.

Build from source

First, clone this repo:

git clone https://github.com/YJHMITWEB/Flover.git
cd Flover
mkdir build
cd build

We use Intel oneTBB in the project, therefore, make sure to enable it:

cmake -DSM={80, your hardware sm architecture} \
      -DCMAKE_BUILD_TYPE=Release \
      -DBUILD_MULTI_GPU={ON, if you want to use tensor parallelism} \
      -DTBB_INCLUDE_DIRS=oneTBB/include \
      -DTBB_LIBRARIES=oneTBB/lib64/libtbb.so \
      -DCMAKE_EXE_LINKER_FLAGS="-ltbb" ..
make -j

The build may take about 30 minutes depending on your system.

Get Start

Model config

We use the config file for model initialization, currently, we support GPT-J and LLaMA models. Config files are located in Flover/src/flover/models/flover/configs

[model_specification]
model_type=llama_13b                 ; gptj_6b, llama_13b, llama_33b
data_type=fp16                       ; fp32, bf16
max_concurrency=32                   ; maximum requests for fusion before OOM.
per_batch_size=1                     ; number of samples in each request
max_seq_len=512                      ; maximum generation length of each request
fixed_input_len=8
fixed_prompt_len=0
beam_width=1
tensor_para_size=1
pipeline_para_size=1                 ; we do not support pipeline parallelism

[runtime_hyperparameter]
top_k=1                              ; k value for top k sampling
top_p=0.0                            ; p value for top p sampling
temperature=1.0                      ; Use for sampling
repetition_penalty=1.0               ; Use for sampling
presence_penalty=0.0                 ; Only one of repetition_penalty and presence_penalty are allowed.
len_penalty=0.0
beam_search_diversity_rate=0.0
data_type=fp16
enable_custom_all_reduce=0
interval=1000                        ; Interval between requests. Poisson pattern requests can be enabled in the code.
use_mem_shuffle=1                    ; Use memory shuffle(reordering) default means requests have different generation lengths. 

[request]
request_output_len=504               ; If use_mem_shuffle=1, this represents the upper bound of request's generation length; else, it represents the uniform length of the generation.

Then use the following command to run Flover inference:

cd Flover/build
mpirun -n 1 [node name] bin/test_flover

Visualization of in-flight requests

We currently provide a visualization terminal output. The following shows the inference progress of 32 requests, each with a random generation length, and the interval between requests is 500 ms.

Updates Nov.7, 23

We compare our work against the latest TensorRT-LLM on parallel inference multiple requests.

Comparison with FT in constant request time interval

Comparison with FT in random request time interval

Comparison with FT in random request output length (sampled from a uniform distribution)

Here we introduce memory reordering to make sure the memory of early finished requests is freed.

Hardware profiling with Nsight

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
3rdparty		3rdparty
bp		bp
cmake		cmake
examples		examples
flover_runtime		flover_runtime
images		images
src		src
tests		tests
CMakeLists.txt		CMakeLists.txt
Flover.png		Flover.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flover

Installation

Environment

Build from source

Get Start

Model config

Visualization of in-flight requests

Updates Nov.7, 23

About

Releases

Packages

Languages

YJHMITWEB/Flover

Folders and files

Latest commit

History

Repository files navigation

Flover

Installation

Environment

Build from source

Get Start

Model config

Visualization of in-flight requests

Updates Nov.7, 23

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages