Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISK] Trim Context Algorithm Issue #1

Open
toreleon opened this issue Feb 21, 2023 · 0 comments
Open

[RISK] Trim Context Algorithm Issue #1

toreleon opened this issue Feb 21, 2023 · 0 comments

Comments

@toreleon
Copy link

Hi Kien, I am currently developing the same Chatbot as catesearch, when I look at your source I see that you have a risk in your context trimming code.
image
You are checking the length of context by split word but the length of words is not the same as the length of tokens and because GPT-3 uses the byte-pair encoding architecture to encode tokens, so your algorithm may stuck in some rare language such as Vietnamese if these are in list_paragraph[:2]. I recommend you use the GPT2TokenizerFast(https://huggingface.co/docs/transformers/model_doc/gpt2) to check the length of tokens instead of split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant