Modify small-batched weight only quantization #2213

dasistwo · 2024-09-10T04:38:57Z

I've found that small-batch weight-only GEMV has suffered from the global memory load stall in some inefficient cases.

This PR uses the shared memory in this case to

Remove some duplicate memory requests in the scale factor/zero point load
Use memcpy_async to establish double buffering for load & MMA, which reduces global memory stall

It had little or no effect on the small GEMV, but had some effect on the large GEMV.

Below is the percentage of reduced GEMV computation time, sum of the 5 types of GEMV kernel in a first decoding stage.
Tested on single A100 40GB and H100 80GB, batch size 4, context length <= 512.

	A100	H100
Gemma 1-7B	7.93%	3.83%
Llama 2-7B	0.84%	7.52%
Llama 2-13B	-0.09%	6.05%
Llama 3-8B	1.62%	9.77%

Update MLP branch with upstream

* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <[email protected]>

Found some error cases with unit test cases with small load chunks.

TODO: Increased instructions hide the ShMem advantage

MARD1NO · 2024-11-13T01:28:15Z

I've found that small-batch weight-only GEMV has suffered from the global memory load stall in some inefficient cases.

This PR uses the shared memory in this case to

Remove some duplicate memory requests in the scale factor/zero point load

Use memcpy_async to establish double buffering for load & MMA, which reduces global memory stall

It had little or no effect on the small GEMV, but had some effect on the large GEMV.

Below is the percentage of reduced GEMV computation time, sum of the 5 types of GEMV kernel in a first decoding stage. Tested on single A100 40GB and H100 80GB, batch size 4, context length <= 512.

A100 H100
Gemma 1-7B 7.93% 3.83%
Llama 2-7B 0.84% 7.52%
Llama 2-13B -0.09% 6.05%
Llama 3-8B 1.62% 9.77%

Great work! and I wonder what is your benchmark gemv quantize type? channelwise or groupwise, and 4bit or 8bit?

dasistwo · 2024-11-14T05:00:56Z

These are the result from the W4A16_AWQ, so the config is 4bit with group size 128. I've also checked it with channel-wise but the performance gain was slightly less than the group-wise one.

Void1024 · 2024-11-21T06:36:49Z

Thank you for your excellent work. I am the author of the batched GEMV kernel in TRT-LLM. My colleagues and I have reviewed and benchmarked your modifications in this PR. We had previously tried a similar approach, but it didn't yield significant benefits at that time.

We validated the kernel latency with your modifications on different shapes on the H100 but found that there was a performance regression in some shapes. Considering that we have other optimization work for this part of the code in progress, we are unable to merge your changes at this time.

Could you please provide benchmark data comparing the kernel latency before and after your changes for different shapes (for example, m=1, 2, 3, 4 and n/k=2048, 4096, 8192, 12288, 16384) under the GPTQ/AWQ case on both A100 and H100?

MARD1NO · 2024-11-21T07:20:18Z

Thank you for your excellent work. I am the author of the batched GEMV kernel in TRT-LLM. My colleagues and I have reviewed and benchmarked your modifications in this PR. We had previously tried a similar approach, but it didn't yield significant benefits at that time.

We validated the kernel latency with your modifications on different shapes on the H100 but found that there was a performance regression in some shapes. Considering that we have other optimization work for this part of the code in progress, we are unable to merge your changes at this time.

Could you please provide benchmark data comparing the kernel latency before and after your changes for different shapes (for example, m=1, 2, 3, 4 and n/k=2048, 4096, 8192, 12288, 16384) under the GPTQ/AWQ case on both A100 and H100?

Hi author, what do you think the idea to use async copy in gemv? gemv is memory bound operation, will async copy boost its performance?
In my experiment (I just write a lowbit cuda core gemv with async copy, not use this weightonly gemv), async copy version performs better only in some larger MNK case and also in some specific devices.

Void1024 · 2024-11-21T08:10:15Z

Hi author, what do you think the idea to use async copy in gemv? gemv is memory bound operation, will async copy boost its performance? In my experiment (I just write a lowbit cuda core gemv with async copy, not use this weightonly gemv), async copy version performs better only in some larger MNK case and also in some specific devices.

Yes, in my previous experiments, I came to a similar conclusion. If the tileMNK is not large enough, there might not be sufficient computation and LDS to hide the latency of copy_async. Furthermore, in GEMV cases with small batch sizes, the data often fits within the registers.

dasistwo and others added 30 commits April 11, 2024 06:24

Change config files to suit in the A100 server

638b533

Update submodules

3d643cc

Merge branch 'NVIDIA:main' into mlp

bac459d

Change summarization task default setting

c03b407

Change CMakeLists for a debugging purpose

298fcc1

Apply shared mem to scale factor of quantization.

dfc29ad

Remove redundancy of loading scale factors

40b1cfb

Apply asyncs to scale factors

cc1d2c1

Apply shared mem asyncs to zeropoints

3fa1699

refactoring offset

fcc7144

Merge branch 'mlp'

0875817

Merge pull request #1 from dasistwo/main

b82286f

Update MLP branch with upstream

Update TensorRT-LLM (NVIDIA#506)

249d93d

* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <[email protected]>

Merge branch 'NVIDIA-main' into mlp

a17b14f

Fix GCC 13 compile error

cb76c98

Fix TensorRT layermap error

c8c6432

Merge branch 'NVIDIA/main'

43bf1a6

Merge branch 'NVIDIA-main' into mlp

5a70210

Fix bug: loading ModelSpec in test

a0f8499

Fix L1 shared bank conflict

a6fe44d

Revoke private changes

bfa1b74

Merge branch 'main' into mlp

7ff5302

Refactor & Revoke commit 'Fix L1 shared bank conflict'

5971baf

Found some error cases with unit test cases with small load chunks.

Copy to shared memory within K iteration

d5ecf92

Merge branch 'NVIDIA:main' into mlp

bde7127

Refactoring & Apply double buffering for weight

c2ccb90

Merge branch 'mlp' of github.com:dasistwo/TensorRT-LLM into mlp

4c224fd

Debug ColumnMajor Test Case

1488f3f

Apply double buffering for Act

e3e6d93

Reduce shared memory size & increase grid size

fb8ab20

dasistwo and others added 6 commits September 6, 2024 11:42

Revert "Increase grid size" & reduce shared memory buffer

599f5e8

TODO: Increased instructions hide the ShMem advantage

Compute memory address at compile time

51b0df6

Apply compile-time calculation for less instruction

44c6699

Debug for ColumnMajor Case

90c798c

Merge branch 'NVIDIA:main' into mlp

0396172

Revoke irrelevant commits

a50ccee

lfr-0531 added triaged Issue has been triaged by maintainers Low Precision Issue about lower bit quantization, including int8, int4, fp8 labels Sep 24, 2024

Merge branch 'NVIDIA:main' into mlp

f344143

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify small-batched weight only quantization #2213

Modify small-batched weight only quantization #2213

dasistwo commented Sep 10, 2024

MARD1NO commented Nov 13, 2024

dasistwo commented Nov 14, 2024

Void1024 commented Nov 21, 2024

MARD1NO commented Nov 21, 2024

Void1024 commented Nov 21, 2024

Modify small-batched weight only quantization #2213

Are you sure you want to change the base?

Modify small-batched weight only quantization #2213

Conversation

dasistwo commented Sep 10, 2024

MARD1NO commented Nov 13, 2024

dasistwo commented Nov 14, 2024

Void1024 commented Nov 21, 2024

MARD1NO commented Nov 21, 2024

Void1024 commented Nov 21, 2024