-
Notifications
You must be signed in to change notification settings - Fork 0
/
command_and_notes.txt
24 lines (17 loc) · 1.88 KB
/
command_and_notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Commands
1. huggingface-cli delete-cache
2. docker build -t {image_name} .
3. docker run --gpus all -p 6006:6006 -itd --name "phr-chat-ai" c366b6937df4
4. HuggingFace TGI is a bit buggged, we need to change content.strip() to content in tokenizer_config.json
5. For creating dev container we use "dockerfile-dev-env" to create an image using 2 , and run the container using 3.
5.1 We have added the requirements.txt to github gist and use it above.
Observations:
1. LLaMA-2, doesn't predict spaces " " in the output, in output ids, generated by the model.generate()
2. LLaMA-2, tokeniser removes the spaces while tokenising so most of the times space is not present in the tokenised ids.
3. tokenizer.decode() adds spaces manually, taking care of the special tokens and punctuation marks.
4. In an LLM, we can control the max_number_of_input_tokens and max_new_tokens, which specify the max no. of tokens to generate. The sum of max_number_of_input_tokens and max_new_tokens is called the context length of the LLM.
5. in HF TGI, max-total-tokens is specified (default 2048) which is the max possible sum of input and output tokens.
6. In model.generate() we can specify max_new_tokens and max_length(sum of input_prompt_tokens and max_new_tokens). There is no parameters of max_number_of_input_tokens and, we need to check that manually before giving a call to the model.generate().
https://huggingface.co/docs/transformers/en/main_classes/text_generation
Training Observations
1. In case of contextual models like BERT, we should not do excess preprocessing like removing stop words, stemming, lemmatization etc. as this might take away the context of the sentence, which pre-trained models like BERT are good at capturing. The issue was observed in suicide-prediction model where model performed better after training on the less processed dataset without stopword removal, punctuation remover and lemmatization.