code for pubmedgpt pre-training #2

yurakuratov · 2022-12-16T11:13:45Z

Hi! I could not find pre-training code as it was mentioned in the blog post:

To train Pubmed GPT easily, quickly, and efficiently, we used the MosaicML Cloud for infrastructure and trained the model using MosaicML’s Composer and Streaming Dataset libraries. All model and training code is built off of PyTorch. See the code here!

https://www.mosaicml.com/blog/introducing-pubmed-gpt

Are you planing to make it public?
It could help to understand how the model was actually trained with MosaicML's Composer?
Another question is how the model trained with FlashAttention was converted to Huggingface-compatible GPT2LMHeadModel checkpoint?

J38 · 2022-12-16T12:11:58Z

Yes we will improve documentation on pre-training, I will discuss with Mosaic ML about what we should post.

metemadi · 2022-12-17T10:40:15Z

thank you for such incredible work! are you able to comment on how the new tokenizer was created - that is - were the combined tokens added to the "end" of the gpt tokenizer, or were tokens removed, etc.? how were the new token embeddings initialized? again a huge thank you for this amazing service to the open source ML community!

J38 · 2022-12-17T11:07:53Z

A brand new tokenizer was trained with 28896 tokens. I'll upload the training script to this repo.

J38 · 2022-12-17T11:10:34Z

I put this in the tokenize folder. I just ran it on a file with all of the text from the PubMed abstracts.

J38 · 2022-12-17T11:11:59Z

When you launch pretraining from scratch with Hugging Face and Composer combination we had, it will just randomly initialize the embeddings ...

J38 · 2022-12-17T11:12:49Z

I believe this is where embeddings get initialized ...

https://github.com/huggingface/transformers/blob/7032e0203262ebb2ebf55da8d2e01f873973e835/src/transformers/models/gpt2/modeling_gpt2.py#L462

metemadi · 2022-12-17T11:15:43Z

Thank you thank you! In the blog post you say "PubMedGPT 2.7B was trained on all the PubMed abstracts and full documents from The Pile." - so do you start with a pre-trained model (like GPT Neo-2.7B that was pre-trained with a different tokenizer, and trained on The Pile) then change tokenizers and then train again on PubMed, or do you just mix pubmed data with the pile data and start the whole thing from scratch? A huge thank you again - this is so cool

J38 · 2022-12-17T11:26:11Z

Everything was from scratch. So we trained the tokenizer first. And then pre-trained the model from scratch using the new tokenizer. There is no connection to any other tokenizer or model.

shashank140195 · 2023-06-29T16:12:50Z

Hi @J38.

Any updates on making the pre-training code of BioMedLM public?

J38 assigned J38 and unassigned J38 Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code for pubmedgpt pre-training #2

code for pubmedgpt pre-training #2

yurakuratov commented Dec 16, 2022

J38 commented Dec 16, 2022

metemadi commented Dec 17, 2022

J38 commented Dec 17, 2022

J38 commented Dec 17, 2022

J38 commented Dec 17, 2022

J38 commented Dec 17, 2022

metemadi commented Dec 17, 2022 •

edited

Loading

J38 commented Dec 17, 2022

shashank140195 commented Jun 29, 2023

code for pubmedgpt pre-training #2

code for pubmedgpt pre-training #2

Comments

yurakuratov commented Dec 16, 2022

J38 commented Dec 16, 2022

metemadi commented Dec 17, 2022

J38 commented Dec 17, 2022

J38 commented Dec 17, 2022

J38 commented Dec 17, 2022

J38 commented Dec 17, 2022

metemadi commented Dec 17, 2022 • edited Loading

J38 commented Dec 17, 2022

shashank140195 commented Jun 29, 2023

metemadi commented Dec 17, 2022 •

edited

Loading