Interest in BERT version? #57

lhallee · 2024-12-20T00:10:08Z

Hello,

This is really amazing work. Certain fields, like protein language modeling, is quite reliant on BERT-like models. Is there any interest in a BERT implementation / speed run similar to this GPT2 version? I suppose the only differences is the attention implementation and data processing?
Best,
Logan

lapp0 · 2024-12-20T07:09:25Z

Hi Logan,

This sounds interesting.

Is there a good baseline model you could reference along with its training dataset? Any specific evaluations?

lhallee · 2024-12-20T16:11:14Z

Glad you think so! ModernBERT just came out, seems like people are still interested in GLUE which may be a good benchmark. Perhaps equivalent to the original BERT-base (84.7) is best as it's similar to this GPT2 project in terms of size and "age." Fine-web is probably still a good dataset choice.

Personally, I'm interested in BERT-like protein language modeling, so the closest equivalent in terms of size to GPT2 is ESM2-150M. The closest equivalent to fine-web is probably OMGprot50. For this could just go off of the ESM2 perplexity or loss on a newer evaluation set.

What do you think?

lapp0 · 2024-12-20T18:14:07Z

Here are some notes:

Dataset:

From the paper, "During training sequences are sampled with even weighting across ∼43 million UniRef50 training clusters from ∼138 million UniRef90 sequences so that over the course of training the model sees ∼65 million unique sequences."
The ESM repo references the "pkl objects containing sequences" and this file which designates the IDs of the pkls for the train-test split.

Objective:

"The ESM-2 language models are trained with the masked language modeling objective (18), which trains the model to predict the identity of randomly selected amino acids in a protein sequence by observing their context in the rest of the sequence"
HF implementation of EsmLMHead

We should start with a minimal-change baseline before experimenting with enhancements

Write a dataloader for OMGprot50 replacing data/fineweb.py
Remove the causal mask from modded-nanogpt (don't need to implement ModernBERTs global / local attention for now).
~~Ignore EsmLMHead for now (ESM2 uses an MLP, modded-nanogpt uses a linear)~~

Then once validated, we can perform these changes

Dataset: IMO, we should reproduce the exact train run as a baseline since the data is available. We can convert the linked pickles to a huggingface hub
Copy benchmark tooling from esm repo

In short, what you said makes sense. Please let me know if you disagree or want clarification about any part of what I said here.

If you'd like to work on this together, you can reach out via email in my profile.

lhallee · 2024-12-20T18:37:04Z

Sounds great, I'll send you an email about this ESM project.

On the topic of modded-nanogpt, are you aware of anyone looking into Diff attention for the speed run? Seems to have some favorable loss convergence, I have observed this locally as well. Doesn't seem directly applicable with flex attention but there are flash attention versions and potential.

lapp0 · 2024-12-20T19:41:55Z

I saw the email. I'll try to get something working by EOD.

On the topic of modded-nanogpt, are you aware of anyone looking into Diff attention for the speed run?

I've tried it and it underperforms the current record.

It might not work well on tiny models - we already have a small head_dim of 128. Differential Transformers requires twice as many heads, with half of the heads being for noise canceling, resulting in a head_dim of 64. Just a hypothesis as to why it didn't work.

Check out #23 if you haven't already

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interest in BERT version? #57

Interest in BERT version? #57

lhallee commented Dec 20, 2024

lapp0 commented Dec 20, 2024 •

edited

Loading

lhallee commented Dec 20, 2024

lapp0 commented Dec 20, 2024 •

edited

Loading

lhallee commented Dec 20, 2024

lapp0 commented Dec 20, 2024 •

edited

Loading

Interest in BERT version? #57

Interest in BERT version? #57

Comments

lhallee commented Dec 20, 2024

lapp0 commented Dec 20, 2024 • edited Loading

lhallee commented Dec 20, 2024

lapp0 commented Dec 20, 2024 • edited Loading

lhallee commented Dec 20, 2024

lapp0 commented Dec 20, 2024 • edited Loading

lapp0 commented Dec 20, 2024 •

edited

Loading

lapp0 commented Dec 20, 2024 •

edited

Loading

lapp0 commented Dec 20, 2024 •

edited

Loading