State of affairs for NestedTensor (NJT) inference? #4234

vadimkantorov · 2024-11-02T10:33:28Z

PyTorch now has some support for representing varlen sequences. It is supported to some extent by HF:

This is useful e.g. for saving compute on padding tokens for BERT inference. Does TRT has kernels for such NJT sdpa ops? (and can they be executed via CUDA graphs?) If so, how to benefit from it? Is there an example?

Thank you!

lix19937 · 2024-11-05T03:19:23Z

In my opinion, trt has no such op(njt), you can custom write it.

If you want to use cudagraph, you need set the max-len of sequence (use the fixed address of sequences to build the static graph), and set the min, opt, max shape for this input.

vadimkantorov · 2024-11-05T09:17:57Z

A key component of NJT support for SDPA are block-diagonal masks. Does TRT have support/examples for block-diagonal attn masks?

Because one would want to have proper FlashAttention kernels in this setup, otherwise the speedups likely may not be realized...

poweiw · 2024-11-05T18:16:12Z

@zhenhuaw-me Can you take a look?

vadimkantorov · 2024-11-07T16:12:28Z

The relevant documentation in Triton Inference Server on ragged batch support: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/ragged_batching.html

So it would be good to have (end-to-end, starting from a PyTorch model, then export and configuring TRT engine file with trex visualizations) examples of optimized attention modules for transformer inference on such varlen sequences in TRT...

vadimkantorov · 2024-11-08T10:08:23Z

These kernels appear available from older FasterTransformer: https://github.com/NVIDIA/FasterTransformer/blob/main/docs/bert_guide.md#model-architecture or in https://github.com/bytedance/effective_transformer

It would be good to upstream these "EffectiveTransformer kernels with TensorRT" given that FasterTransformer EOL'd

vadimkantorov changed the title ~~State of affairs for NestedTensor (NJT) inference~~ State of affairs for NestedTensor (NJT) inference? Nov 2, 2024

poweiw added the triaged Issue has been triaged by maintainers label Nov 5, 2024

vadimkantorov mentioned this issue Nov 8, 2024

Example of using Ragged Batching with FasterTransformer / TRT-LLM for zero-padding BERT inference ("continuous batching") triton-inference-server/server#7777

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State of affairs for NestedTensor (NJT) inference? #4234

State of affairs for NestedTensor (NJT) inference? #4234

vadimkantorov commented Nov 2, 2024

lix19937 commented Nov 5, 2024 •

edited

Loading

vadimkantorov commented Nov 5, 2024 •

edited

Loading

poweiw commented Nov 5, 2024

vadimkantorov commented Nov 7, 2024

vadimkantorov commented Nov 8, 2024 •

edited

Loading

State of affairs for NestedTensor (NJT) inference? #4234

State of affairs for NestedTensor (NJT) inference? #4234

Comments

vadimkantorov commented Nov 2, 2024

lix19937 commented Nov 5, 2024 • edited Loading

vadimkantorov commented Nov 5, 2024 • edited Loading

poweiw commented Nov 5, 2024

vadimkantorov commented Nov 7, 2024

vadimkantorov commented Nov 8, 2024 • edited Loading

lix19937 commented Nov 5, 2024 •

edited

Loading

vadimkantorov commented Nov 5, 2024 •

edited

Loading

vadimkantorov commented Nov 8, 2024 •

edited

Loading