-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about your paper #2
Comments
mark |
Thanks for the interest in our work! I think it's quite hard to give definitive answers to your questions as it is difficult to define what "general tasks" mean, and also we don't have an accurate way of characterizing the relation between the general tasks and the instruction tuning data you use. Nevertheless, I will try to break it down and post my thoughts here:
Again, it is hard to have a definitive answer, but I hope our work and my thoughts here could open up some directions for you to think about what would be a better strategy for your case! |
I'm sorry. But, I don't quite understand the meaning of "reasoning over parametric knowledge". I'm fine tuning over already pretrained model which is "Llama 3 7B". So, I could not know the knowledge inside the pretrained "Llama 3 7B" is already has the knowledge from my dataset or not. I also could not do pretraining because of the size of the model is quite large and pretraining need a really massive resource.
Did you mean that if for example the base model could understand a sentiment task (a task to categorized text into positive, neutral, or negative sentiment based on the context) but the base model doesn't have knowledge about certain context like for example a person that is generally not famous that will be known by the model, I should just train the data by putting sample pair of text about the person and the sentiment about the text to the model but not until grokking? |
Hi, I've been reading your paper and I find it amusing that by grokking, transformer model could reach high accuracy even to evaluate data. Completely different from I know that overfitting is bad. I have several question about fine tuning transfomer model.
So, I've been fine tuning llama-3 with my dataset which consist of ultrachat, sharegpt4, etc. Does by grokking, the model with do general task better? Because your paper only do single type of task train to a model. Also, your paper said that the size of data doesn't matter. But, fine tuning tends to have a very large datasets and loss lowering is quite hard even to train dataset. Does grokking still a good idea? Because the time to fine tune until grokked is reach is very long. What about fine tuning using lora? Does grokking with lora still work or does it only work to full parameter training?
Thank you for your time to read and answer my questions.
The text was updated successfully, but these errors were encountered: