You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to my own implementations, here are some questions about the settings of vertical_size and slash_size. It seems that the larger vertical_size and slash_size, the performances (i.e. ppl in my experiment) are not promised better. Intuitively, with the increase of vertical and slash size, more weights in attention matrix are reserved (as well as the corresponding kv cache), the performance should have been better. However, my experimental results are sometimes against this. And it seems that there is a trade-off between v_size and s_size, in my experiments s_size has a larger impact on the performances.
I wonder in your empirical experiments which explore the setting v_size and s_size (i.e. (30, 800) (500, 700) (1000, 6096)....), is the performance better with the increase of v_size and s_size, or is there any other specific pattern?
Looking forward to your reply!
The text was updated successfully, but these errors were encountered:
Supplements (if necessary):
I conduct the experiments on llama2-7B, with the sequence length 4k, last_q 64 (inference). The metric is ppl on pg19. The experiments aim to explore the impact on vertical_size and slash_size on the performance only (without considering the efficiency currently).
Generally speaking, different heads have varying sensitivity to vertical size and slash size. Some heads, such as the one with the config (3500, 100), require a larger vertical size rather than slash size.
Secondly, PPL in long-context scenarios is not an effective indicator. For PPL, local windows are very important and almost exclusively related to local windows. This is why the StreamingLLM method shows such good results in PPL tests. For downstream tasks, I would recommend using KV retrieval or Needle In A Haystack (though it is simple, it can reflect the capabilities of different context windows and depth).
Thanks for your response! : ) I am also wondering whether the y-axis in Figure 5 represents the log of perplexity (i.e., e^{8-10}) or the actual perplexity values (i.e., 8-10)?
Thanks for your response! : ) I am also wondering whether the y-axis in Figure 5 represents the log of perplexity (i.e., e^{8-10}) or the actual perplexity values (i.e., 8-10)?
Describe the issue
Thanks for the great work!
According to my own implementations, here are some questions about the settings of vertical_size and slash_size. It seems that the larger vertical_size and slash_size, the performances (i.e. ppl in my experiment) are not promised better. Intuitively, with the increase of vertical and slash size, more weights in attention matrix are reserved (as well as the corresponding kv cache), the performance should have been better. However, my experimental results are sometimes against this. And it seems that there is a trade-off between v_size and s_size, in my experiments s_size has a larger impact on the performances.
I wonder in your empirical experiments which explore the setting v_size and s_size (i.e. (30, 800) (500, 700) (1000, 6096)....), is the performance better with the increase of v_size and s_size, or is there any other specific pattern?
Looking forward to your reply!
The text was updated successfully, but these errors were encountered: