-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interest in BERT version? #57
Comments
Hi Logan, This sounds interesting. Is there a good baseline model you could reference along with its training dataset? Any specific evaluations? |
Glad you think so! ModernBERT just came out, seems like people are still interested in GLUE which may be a good benchmark. Perhaps equivalent to the original BERT-base (84.7) is best as it's similar to this GPT2 project in terms of size and "age." Fine-web is probably still a good dataset choice. Personally, I'm interested in BERT-like protein language modeling, so the closest equivalent in terms of size to GPT2 is ESM2-150M. The closest equivalent to fine-web is probably OMGprot50. For this could just go off of the ESM2 perplexity or loss on a newer evaluation set. What do you think? |
Here are some notes: Dataset:
Objective:
We should start with a minimal-change baseline before experimenting with enhancements
Then once validated, we can perform these changes
In short, what you said makes sense. Please let me know if you disagree or want clarification about any part of what I said here. If you'd like to work on this together, you can reach out via email in my profile. |
Sounds great, I'll send you an email about this ESM project. On the topic of modded-nanogpt, are you aware of anyone looking into Diff attention for the speed run? Seems to have some favorable loss convergence, I have observed this locally as well. Doesn't seem directly applicable with flex attention but there are flash attention versions and potential. |
I saw the email. I'll try to get something working by EOD.
I've tried it and it underperforms the current record. It might not work well on tiny models - we already have a small head_dim of Check out #23 if you haven't already |
Hello,
This is really amazing work. Certain fields, like protein language modeling, is quite reliant on BERT-like models. Is there any interest in a BERT implementation / speed run similar to this GPT2 version? I suppose the only differences is the attention implementation and data processing?
Best,
Logan
The text was updated successfully, but these errors were encountered: