Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MoE details #298

Merged
merged 2 commits into from
Oct 29, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions inference_rules.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -256,8 +256,8 @@ The Datacenter suite includes the following benchmarks:
|Speech |Speech-to-text |RNNT |Librispeech dev-clean (samples < 15 seconds) | 2513 | 99% of FP32 (1 - WER, where WER=7.452253714852645%) | 1000 ms
|Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 and 99.9% of FP32 (f1_score=90.874%) | 130 ms
|Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=4016878)| 20 s
|Language |Question Answering |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
|Language |Text Generation (Question Answering, Math and Code Generation) |Mixtral-8x7B |OpenOrca (5k samples of the GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) | 15000 | 99% of FP32 and 99.9% of FP32 (rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12). Additionally, for both cases the tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Mixtral-8x7B, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
|Language |Question Answering |Llama2 |OpenOrca (max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
|Language |Text Generation (Question Answering, Math and Code Generation) |Mixtral-8x7B |OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048) | 15000 | 99% of FP16 ((OpenOrca)rouge1=45.4911, (OpenOrca)rouge2=23.2829, (OpenOrca)rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12). Additionally, for both cases the tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Mixtral-8x7B, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
|Commerce |Recommendation |DLRMv2 |Synthetic Multihot Criteo Dataset | 204800 |99% of FP32 and 99.9% of FP32 (AUC=80.31%) | 60 ms
|Generative |Text to image |SDXL |Subset of coco-2014 val | 5000 |FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s
|===
Expand Down Expand Up @@ -1036,8 +1036,8 @@ Datacenter systems must provide at least the following bandwidths from the netwo
|Speech |RNNT |Librispeech dev-clean (samples < 15 seconds) | __max_audio_duration*num_samples_per_sec*(bits_per_sample/8)__ | __15*16000*(16/8)__ | __throughput*480000__
|Language |BERT |SQuAD v1.1 (max_seq_len=384) | __num_inputs*max_seq_len*dtype_size__ | __3*384*dtype_size__ | __throughput*1152*dtype_size__
|Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | __num_inputs*max_seq_len*dtype_size__ | __2048*dtype_size__ | __throughput*2048*dtype_size__
|Language |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | __num_inputs*max_seq_len*dtype_size__ | __1024*dtype_size__ | __throughput*1024*dtype_size__
|Language |Mixtral-8x7B |OpenOrca (5k samples of the GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) | __num_inputs*max_seq_len*dtype_size__ | __2048*dtype_size__ | __throughput*2048*dtype_size__
|Language |Llama2 |OpenOrca (max_seq_len=1024) | __num_inputs*max_seq_len*dtype_size__ | __1024*dtype_size__ | __throughput*1024*dtype_size__
|Language |Mixtral-8x7B |OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048) | __num_inputs*max_seq_len*dtype_size__ | __2048*dtype_size__ | __throughput*2048*dtype_size__
|Commerce |DLRMv2 | 1TB Click Logs |__avg(num_pairs_per_sample)*(num_numerical_inputs*dtype_size~1~ +num_categorical_inputs*dtype_size~2~))__footnote:[Each DLRMv2 sample consists of up to 700 user-item pairs draw from the distribution specified in https://github.com/mlcommons/inference/blob/master/recommendation/dlrm/pytorch/tools/dist_quantile.txt[dist_quantile.txt].] |__270*(13*dtype_size~1~+26*dtype_size~2~)__ | __throughput*270*(13*dtype_size~1~+26*dtype_size~2~)__
|Generative |SDXL |Subset of coco-2014 val captions (max_prompt_len=77) | __num_inputs*max_prompt_len*dtype_size__ | __77*dtype_size__ | __throughput*77*dtype_size__
|===
Expand Down
Loading