Skip to content

khalidsaifullaah/gpt2-bengali

Repository files navigation

gpt2-bengali

Currently, there is no GPT2 model that was trained from scratch for Bengali on the hub. For this project, the goal is to create a strong language generation model for Bengali using GPT2 Model.

Pretraining Process

huggingface-cli repo create norwegian-gpt2

Next we clone the model repository to add the tokenizer and model files.

git clone https://huggingface.co/<your-username>/gpt2-bengali

To ensure that all tensorboard traces will be uploaded correctly, we need to track them. You can run the following command inside your model repo to do so.

cd gpt2-bengali
git lfs track "*tfevents*"

Great, we have set up our model repository. During training, we will automatically push the training logs and model weights to the repo.

Next, let's add a symbolic link to the run_clm_flax.py.

export MODEL_DIR="./gpt2-bengali"
ln -s ~/transformers/examples/flax/language-modeling/run_clm_flax.py run_clm_flax.py

Next, we'll follow the same steps as above in Train tokenizer to train the tokenizer.

Create configuration

Next, we create the model's configuration file. This is as simple as loading and storing **gpt2** in the local model folder:

from transformers import GPT2Config

model_dir = "./gpt2-bengali"  # ${MODEL_DIR}

config = GPT2Config.from_pretrained("gpt2", resid_pdrop=0.0, embd_pdrop=0.0, attn_pdrop=0.0)
config.save_pretrained(model_dir)

Train model

Next we can run the example script to pretrain the model:

./run_clm_flax.py \
    --output_dir="${MODEL_DIR}" \
    --model_type="gpt2" \
    --config_name="${MODEL_DIR}" \
    --tokenizer_name="${MODEL_DIR}" \
    --dataset_name="mc4" \
    --dataset_config_name="bn" \
    --do_train --do_eval \
    --block_size="512" \
    --per_device_train_batch_size="64" \
    --per_device_eval_batch_size="64" \
    --learning_rate="5e-3" --warmup_steps="1000" \
    --adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
    --overwrite_output_dir \
    --num_train_epochs="20" \
    --logging_steps="500" \
    --save_steps="2500" \
    --eval_steps="2500" \
    --push_to_hub

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published