You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can you print out and check if the aggregate_metrics['accept_counts'] makes sense? accept_counts means how many token prediction from the draft model has been accepted by the verifier model. If it's too low, you can't get too much performance boost from speculative sampling.
I tested the Speculative Sampling method with llama2-7b and llama2-70b on the a800, but their boost effect was almost zero and negative in most cases.
llama2-7b base 103.25 tokens/s
llama2-7b Speculative Sampling 104.52 tokens/s
llama2-70b base 14.55tokens/s
llama2-70b Speculative Sampling 13.41 tokens/s
The text was updated successfully, but these errors were encountered: