Question about your paper #2

fahadh4ilyas · 2024-06-13T07:04:33Z

Hi, I've been reading your paper and I find it amusing that by grokking, transformer model could reach high accuracy even to evaluate data. Completely different from I know that overfitting is bad. I have several question about fine tuning transfomer model.

So, I've been fine tuning llama-3 with my dataset which consist of ultrachat, sharegpt4, etc. Does by grokking, the model with do general task better? Because your paper only do single type of task train to a model. Also, your paper said that the size of data doesn't matter. But, fine tuning tends to have a very large datasets and loss lowering is quite hard even to train dataset. Does grokking still a good idea? Because the time to fine tune until grokked is reach is very long. What about fine tuning using lora? Does grokking with lora still work or does it only work to full parameter training?

Thank you for your time to read and answer my questions.

ThisIsSoMe · 2024-06-14T03:48:20Z

mark

Boshi-Wang · 2024-06-20T04:30:52Z

Thanks for the interest in our work!

I think it's quite hard to give definitive answers to your questions as it is difficult to define what "general tasks" mean, and also we don't have an accurate way of characterizing the relation between the general tasks and the instruction tuning data you use. Nevertheless, I will try to break it down and post my thoughts here:

In our paper, we are always training transformers from scratch. In the pre-train + fine-tune scenario, things could be quite complicated. For example, do you think the knowledge/skills required for solving the "general tasks" are mostly gained from your instruction tuning data, or from the pretrained model? If you think that most knowledge/skills required are already there in the pretrained model, then the focus should be how to elicit those skills and prevent forgetting, and hence you also don't want to run fine-tuning for very long (and certainly not till grokking) since the model will forget more and more as it fits specifically to your instruction tuning data. On the other hand, if you think the knowledge/skills are actually mostly from your instruction tuning data, then tuning for very long and even till grokking may be a good idea. Of course in reality things are always somewhere in between, so the best strategy would again determine on quantifications of their relative strengths.
Do you think the general tasks you care about could be somehow formulated as reasoning over parametric knowledge? If yes, then where do you think are the necessary knowledge and rules acquired (again in view of the tuning data and pretrained model)? If both are there in the pretrained model, then again this is the first situation above. If the model can systematically apply the rules over knowledge and just doesn't have the knowledge there in the tuning data (e.g., certain domain-specific knowledge), you can just train the model to additionally memorize those knowledge, no need to train till grokking. Now if the pretrained model doesn't have rules/the ability to systematically apply the rules, this is where grokking is likely needed. You may also need to consider augmenting the samples close to the role of "inferred facts" in our work in your tuning data, which should accelerate generalization. Also, there might be some inherent limitations in the current transformer's generalization, which, since I assume you can't change the architecture, you may want to consider augmenting your tuning data with verbalized reasoning steps (e.g., maybe get them from CoT prompting some SoTA LLMs).

Again, it is hard to have a definitive answer, but I hope our work and my thoughts here could open up some directions for you to think about what would be a better strategy for your case!

fahadh4ilyas · 2024-06-21T12:19:19Z

Do you think the general tasks you care about could be somehow formulated as reasoning over parametric knowledge?

I'm sorry. But, I don't quite understand the meaning of "reasoning over parametric knowledge". I'm fine tuning over already pretrained model which is "Llama 3 7B". So, I could not know the knowledge inside the pretrained "Llama 3 7B" is already has the knowledge from my dataset or not. I also could not do pretraining because of the size of the model is quite large and pretraining need a really massive resource.

If the model can systematically apply the rules over knowledge and just doesn't have the knowledge there in the tuning data (e.g., certain domain-specific knowledge), you can just train the model to additionally memorize those knowledge, no need to train till grokking.

Did you mean that if for example the base model could understand a sentiment task (a task to categorized text into positive, neutral, or negative sentiment based on the context) but the base model doesn't have knowledge about certain context like for example a person that is generally not famous that will be known by the model, I should just train the data by putting sample pair of text about the person and the sentiment about the text to the model but not until grokking?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about your paper #2

Question about your paper #2

fahadh4ilyas commented Jun 13, 2024

ThisIsSoMe commented Jun 14, 2024

Boshi-Wang commented Jun 20, 2024 •

edited

Loading

fahadh4ilyas commented Jun 21, 2024

Question about your paper #2

Question about your paper #2

Comments

fahadh4ilyas commented Jun 13, 2024

ThisIsSoMe commented Jun 14, 2024

Boshi-Wang commented Jun 20, 2024 • edited Loading

fahadh4ilyas commented Jun 21, 2024

Boshi-Wang commented Jun 20, 2024 •

edited

Loading