Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interest in BERT version? #57

Open
lhallee opened this issue Dec 20, 2024 · 5 comments
Open

Interest in BERT version? #57

lhallee opened this issue Dec 20, 2024 · 5 comments

Comments

@lhallee
Copy link

lhallee commented Dec 20, 2024

Hello,

This is really amazing work. Certain fields, like protein language modeling, is quite reliant on BERT-like models. Is there any interest in a BERT implementation / speed run similar to this GPT2 version? I suppose the only differences is the attention implementation and data processing?
Best,
Logan

@lapp0
Copy link

lapp0 commented Dec 20, 2024

Hi Logan,

This sounds interesting.

Is there a good baseline model you could reference along with its training dataset? Any specific evaluations?

@lhallee
Copy link
Author

lhallee commented Dec 20, 2024

Glad you think so! ModernBERT just came out, seems like people are still interested in GLUE which may be a good benchmark. Perhaps equivalent to the original BERT-base (84.7) is best as it's similar to this GPT2 project in terms of size and "age." Fine-web is probably still a good dataset choice.

Personally, I'm interested in BERT-like protein language modeling, so the closest equivalent in terms of size to GPT2 is ESM2-150M. The closest equivalent to fine-web is probably OMGprot50. For this could just go off of the ESM2 perplexity or loss on a newer evaluation set.

What do you think?

@lapp0
Copy link

lapp0 commented Dec 20, 2024

Here are some notes:

Dataset:

  • From the paper, "During training sequences are sampled with even weighting across ∼43 million UniRef50 training clusters from ∼138 million UniRef90 sequences so that over the course of training the model sees ∼65 million unique sequences."
  • The ESM repo references the "pkl objects containing sequences" and this file which designates the IDs of the pkls for the train-test split.

Objective:

  • "The ESM-2 language models are trained with the masked language modeling objective (18), which trains the model to predict the identity of randomly selected amino acids in a protein sequence by observing their context in the rest of the sequence"
  • HF implementation of EsmLMHead

We should start with a minimal-change baseline before experimenting with enhancements

  • Write a dataloader for OMGprot50 replacing data/fineweb.py
  • Remove the causal mask from modded-nanogpt (don't need to implement ModernBERTs global / local attention for now).
  • Ignore EsmLMHead for now (ESM2 uses an MLP, modded-nanogpt uses a linear)

Then once validated, we can perform these changes

  • Dataset: IMO, we should reproduce the exact train run as a baseline since the data is available. We can convert the linked pickles to a huggingface hub
  • Copy benchmark tooling from esm repo

In short, what you said makes sense. Please let me know if you disagree or want clarification about any part of what I said here.

If you'd like to work on this together, you can reach out via email in my profile.

@lhallee
Copy link
Author

lhallee commented Dec 20, 2024

Sounds great, I'll send you an email about this ESM project.

On the topic of modded-nanogpt, are you aware of anyone looking into Diff attention for the speed run? Seems to have some favorable loss convergence, I have observed this locally as well. Doesn't seem directly applicable with flex attention but there are flash attention versions and potential.

@lapp0
Copy link

lapp0 commented Dec 20, 2024

I saw the email. I'll try to get something working by EOD.

On the topic of modded-nanogpt, are you aware of anyone looking into Diff attention for the speed run?

I've tried it and it underperforms the current record.

It might not work well on tiny models - we already have a small head_dim of 128. Differential Transformers requires twice as many heads, with half of the heads being for noise canceling, resulting in a head_dim of 64. Just a hypothesis as to why it didn't work.

Check out #23 if you haven't already

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants