Name		Name	Last commit message	Last commit date
Latest commit History 175 Commits
.github/workflows		.github/workflows
.vscode		.vscode
anyscale		anyscale
bentoml		bentoml
candle-vllm		candle-vllm
common		common
ctranslate		ctranslate
exllama		exllama
hf-endpoint		hf-endpoint
hf		hf
mistralrs		mistralrs
mlc		mlc
openai		openai
powerinfer		powerinfer
sagemaker		sagemaker
tgi		tgi
tools/analytics		tools/analytics
triton-tensorRT-quantized-awq-batch		triton-tensorRT-quantized-awq-batch
triton-tensorRT-quantized-awq		triton-tensorRT-quantized-awq
triton-tensorRT-quantized		triton-tensorRT-quantized
triton-tensorRT		triton-tensorRT
triton-vllm-awq-8bit		triton-vllm-awq-8bit
triton-vllm-awq		triton-vllm-awq
triton-vllm		triton-vllm
trt-bench		trt-bench
vllm-2		vllm-2
vllm		vllm
.flake8		.flake8
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.ruff.toml		.ruff.toml
README.md		README.md
Summary.ipynb		Summary.ipynb
benchmark.ipynb		benchmark.ipynb

Repository files navigation

llama inference

Exploration of latency on various setups of inference with llama.

I didn't explore throughput. That is a deep rabbit hole - I was just exploring latency for a single request. You can tradeoff throughput and latency with various forms of batching requests.
I tried my best to use tools based on the documentation provided.