-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
code for pubmedgpt pre-training #2
Comments
Yes we will improve documentation on pre-training, I will discuss with Mosaic ML about what we should post. |
thank you for such incredible work! are you able to comment on how the new tokenizer was created - that is - were the combined tokens added to the "end" of the gpt tokenizer, or were tokens removed, etc.? how were the new token embeddings initialized? again a huge thank you for this amazing service to the open source ML community! |
A brand new tokenizer was trained with 28896 tokens. I'll upload the training script to this repo. |
I put this in the |
When you launch pretraining from scratch with Hugging Face and Composer combination we had, it will just randomly initialize the embeddings ... |
I believe this is where embeddings get initialized ... |
Thank you thank you! In the blog post you say "PubMedGPT 2.7B was trained on all the PubMed abstracts and full documents from The Pile." - so do you start with a pre-trained model (like GPT Neo-2.7B that was pre-trained with a different tokenizer, and trained on The Pile) then change tokenizers and then train again on PubMed, or do you just mix pubmed data with the pile data and start the whole thing from scratch? A huge thank you again - this is so cool |
Everything was from scratch. So we trained the tokenizer first. And then pre-trained the model from scratch using the new tokenizer. There is no connection to any other tokenizer or model. |
Hi @J38. Any updates on making the pre-training code of BioMedLM public? |
Hi! I could not find pre-training code as it was mentioned in the blog post:
https://www.mosaicml.com/blog/introducing-pubmed-gpt
Are you planing to make it public?
It could help to understand how the model was actually trained with MosaicML's Composer?
Another question is how the model trained with FlashAttention was converted to Huggingface-compatible GPT2LMHeadModel checkpoint?
The text was updated successfully, but these errors were encountered: