Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code for pubmedgpt pre-training #2

Open
yurakuratov opened this issue Dec 16, 2022 · 9 comments
Open

code for pubmedgpt pre-training #2

yurakuratov opened this issue Dec 16, 2022 · 9 comments

Comments

@yurakuratov
Copy link

Hi! I could not find pre-training code as it was mentioned in the blog post:

To train Pubmed GPT easily, quickly, and efficiently, we used the MosaicML Cloud for infrastructure and trained the model using MosaicML’s Composer and Streaming Dataset libraries. All model and training code is built off of PyTorch. See the code here!

https://www.mosaicml.com/blog/introducing-pubmed-gpt

Are you planing to make it public?
It could help to understand how the model was actually trained with MosaicML's Composer?
Another question is how the model trained with FlashAttention was converted to Huggingface-compatible GPT2LMHeadModel checkpoint?

@J38
Copy link
Contributor

J38 commented Dec 16, 2022

Yes we will improve documentation on pre-training, I will discuss with Mosaic ML about what we should post.

@J38 J38 assigned J38 and unassigned J38 Dec 16, 2022
@metemadi
Copy link

thank you for such incredible work! are you able to comment on how the new tokenizer was created - that is - were the combined tokens added to the "end" of the gpt tokenizer, or were tokens removed, etc.? how were the new token embeddings initialized? again a huge thank you for this amazing service to the open source ML community!

@J38
Copy link
Contributor

J38 commented Dec 17, 2022

A brand new tokenizer was trained with 28896 tokens. I'll upload the training script to this repo.

@J38
Copy link
Contributor

J38 commented Dec 17, 2022

I put this in the tokenize folder. I just ran it on a file with all of the text from the PubMed abstracts.

@J38
Copy link
Contributor

J38 commented Dec 17, 2022

When you launch pretraining from scratch with Hugging Face and Composer combination we had, it will just randomly initialize the embeddings ...

@J38
Copy link
Contributor

J38 commented Dec 17, 2022

@metemadi
Copy link

metemadi commented Dec 17, 2022

Thank you thank you! In the blog post you say "PubMedGPT 2.7B was trained on all the PubMed abstracts and full documents from The Pile." - so do you start with a pre-trained model (like GPT Neo-2.7B that was pre-trained with a different tokenizer, and trained on The Pile) then change tokenizers and then train again on PubMed, or do you just mix pubmed data with the pile data and start the whole thing from scratch? A huge thank you again - this is so cool

@J38
Copy link
Contributor

J38 commented Dec 17, 2022

Everything was from scratch. So we trained the tokenizer first. And then pre-trained the model from scratch using the new tokenizer. There is no connection to any other tokenizer or model.

@shashank140195
Copy link

Hi @J38.

Any updates on making the pre-training code of BioMedLM public?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants