-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve inference speed of multi-query attention model #3
Comments
Hello @harm-devries, One central missing point in the experiments done above is that Would be worthwhile to try experiments using larger batche sizes. |
To confirm my hypothesis, I quickly did below experiments on A100 (80GB) GPU. In the below experiment, you can see with batch_size=1024, seq_length=128, MultiQuery1 being 6.3X faster (38/6) compared to Multihead. python profile_hf_generate.py
/home/sourab/bigcode/transformers/src/transformers/__init__.py
NVIDIA A100-SXM4-80GB
-------------------- attention_type == AttentionType.MULTI_QUERY---------------------
{'get_test_batch': 2.193450927734375e-05, 'generate_text_batch': 10.881535291671753, 'input_batch_size': 1024, 'input_batch_length': 16, 'max_gen_length': 128, 'num_beams': 1, 'do_sample': False, 'pad_token_id': 50256, 'dtype': torch.int64, 'device': device(type='cuda'), 'cuda_device_name': 'NVIDIA A100-SXM4-80GB'}
-------------------- attention_type == AttentionType.MULTI_QUERY---------------------
{'get_test_batch': 2.193450927734375e-05, 'generate_text_batch': 10.306073904037476, 'input_batch_size': 1024, 'input_batch_length': 16, 'max_gen_length': 128, 'num_beams': 1, 'do_sample': False, 'pad_token_id': 50256, 'dtype': torch.int64, 'device': device(type='cuda'), 'cuda_device_name': 'NVIDIA A100-SXM4-80GB'}
-------------------- attention_type == AttentionType.MULTI_QUERY_1---------------------
{'get_test_batch': 2.09808349609375e-05, 'generate_text_batch': 6.453148603439331, 'input_batch_size': 1024, 'input_batch_length': 16, 'max_gen_length': 128, 'num_beams': 1, 'do_sample': False, 'pad_token_id': 50256, 'dtype': torch.int64, 'device': device(type='cuda'), 'cuda_device_name': 'NVIDIA A100-SXM4-80GB'}
-------------------- attention_type == AttentionType.MULTI_HEAD---------------------
{'get_test_batch': 2.288818359375e-05, 'generate_text_batch': 38.42392134666443, 'input_batch_size': 1024, 'input_batch_length': 16, 'max_gen_length': 128, 'num_beams': 1, 'do_sample': False, 'pad_token_id': 50256, 'dtype': torch.int64, 'device': device(type='cuda'), 'cuda_device_name': 'NVIDIA A100-SXM4-80GB'}
In the below experiment, you can see with batch_size=1024, seq_length=1024, Multi-head OOMs, but MuliQuery run with MultiQuery1 being 4.32X faster (340/80) when compared to MultiQuery and both can work for large batches and sequences.
So, Multi-Query attention doesn't help to reduce the latency in adaptive inference wherein we are trying to generate for a single prompt(batch_size=1, e.g., model hub inference API) or smaller batch_sizes. I hope this helps. Appendix: Replicating the main experiments results with batch_size=8 and seq_length=1024 for reference in order to rule out GPU causing above behaviour. python profile_hf_generate.py
/home/sourab/bigcode/transformers/src/transformers/__init__.py
NVIDIA A100-SXM4-80GB
-------------------- attention_type == AttentionType.MULTI_QUERY---------------------
{'get_test_batch': 2.1696090698242188e-05, 'generate_text_batch': 18.797884225845337, 'input_batch_size': 8, 'input_batch_length': 16, 'max_gen_length': 1024, 'num_beams': 1, 'do_sample': False, 'pad_token_id': 50256, 'dtype': torch.int64, 'device': device(type='cuda'), 'cuda_device_name': 'NVIDIA A100-SXM4-80GB'}
-------------------- attention_type == AttentionType.MULTI_QUERY---------------------
{'get_test_batch': 2.193450927734375e-05, 'generate_text_batch': 18.270429134368896, 'input_batch_size': 8, 'input_batch_length': 16, 'max_gen_length': 1024, 'num_beams': 1, 'do_sample': False, 'pad_token_id': 50256, 'dtype': torch.int64, 'device': device(type='cuda'), 'cuda_device_name': 'NVIDIA A100-SXM4-80GB'}
-------------------- attention_type == AttentionType.MULTI_QUERY_1---------------------
{'get_test_batch': 2.288818359375e-05, 'generate_text_batch': 16.58125400543213, 'input_batch_size': 8, 'input_batch_length': 16, 'max_gen_length': 1024, 'num_beams': 1, 'do_sample': False, 'pad_token_id': 50256, 'dtype': torch.int64, 'device': device(type='cuda'), 'cuda_device_name': 'NVIDIA A100-SXM4-80GB'}
-------------------- attention_type == AttentionType.MULTI_HEAD---------------------
{'get_test_batch': 2.2172927856445312e-05, 'generate_text_batch': 19.13312315940857, 'input_batch_size': 8, 'input_batch_length': 16, 'max_gen_length': 1024, 'num_beams': 1, 'do_sample': False, 'pad_token_id': 50256, 'dtype': torch.int64, 'device': device(type='cuda'), 'cuda_device_name': 'NVIDIA A100-SXM4-80GB'} |
The multi-query attention paper reports up to 10x speed-ups compared to incremental decoding with multi-head attention model. We've implemented multi-query attention but only observed up to 25% speed-ups when it's fully integrated in the Transformers model. We did observe up to 2x speed-ups for a simplified version of the attention layer (without softmax and layer normalization). See more details here.
Further inference gains are likely possible but do require further investigation. For example, we would like to benchmark the difference in a more optimized inference environment like Deepspeed-inference. We are also happy to discuss other solutions and directions in the #wg-inference channel.
The text was updated successfully, but these errors were encountered: