Replies: 1 comment 2 replies
-
FT does not have converter to convert the TF checkpoint to triton. FT only provide the converter of HF bert now (at FT's BERT does not contains some optimization in TensorRT, like fusion of GEMM+GELU. So, there are small performance gap between FT and TRT on BERT model. For BERT inference, you can also ask how to run TRT BERT on triton directly, which should be the suggested solution on BERT model. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, we have a BERT model fine-tuned with Tensorflow, and saved as a Tensorflow checkpoint file. Is there any example or instructions on how to run this BERT model with the FT backend on the latest triton inference server?
Also, does the FT backend have all the optimizations described in https://developer.nvidia.com/blog/nlu-with-tensorrt-bert/ when used to power this BERT model? We followed the examples in https://github.com/NVIDIA/TensorRT/tree/release/5.1/demo/BERT/python on triton inference server 20.09, and were able to reproduce the optimized latency results, but if we upgrade to the newer triton server version with the FasterTransformer backend, should we expect similar (or even better) latency results?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions