gpt2-bengali

Currently, there is no GPT2 model that was trained from scratch for Bengali on the hub. For this project, the goal is to create a strong language generation model for Bengali using GPT2 Model.

Pretraining Process

huggingface-cli repo create norwegian-gpt2

Next we clone the model repository to add the tokenizer and model files.

git clone https://huggingface.co/<your-username>/gpt2-bengali

To ensure that all tensorboard traces will be uploaded correctly, we need to track them. You can run the following command inside your model repo to do so.

cd gpt2-bengali
git lfs track "*tfevents*"

Great, we have set up our model repository. During training, we will automatically push the training logs and model weights to the repo.

Next, let's add a symbolic link to the run_clm_flax.py.

export MODEL_DIR="./gpt2-bengali"
ln -s ~/transformers/examples/flax/language-modeling/run_clm_flax.py run_clm_flax.py

Next, we'll follow the same steps as above in Train tokenizer to train the tokenizer.

Create configuration

Next, we create the model's configuration file. This is as simple as loading and storing **gpt2** in the local model folder:

from transformers import GPT2Config

model_dir = "./gpt2-bengali"  # ${MODEL_DIR}

config = GPT2Config.from_pretrained("gpt2", resid_pdrop=0.0, embd_pdrop=0.0, attn_pdrop=0.0)
config.save_pretrained(model_dir)

Train model

Next we can run the example script to pretrain the model:

./run_clm_flax.py \
    --output_dir="${MODEL_DIR}" \
    --model_type="gpt2" \
    --config_name="${MODEL_DIR}" \
    --tokenizer_name="${MODEL_DIR}" \
    --dataset_name="mc4" \
    --dataset_config_name="bn" \
    --do_train --do_eval \
    --block_size="512" \
    --per_device_train_batch_size="64" \
    --per_device_eval_batch_size="64" \
    --learning_rate="5e-3" --warmup_steps="1000" \
    --adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
    --overwrite_output_dir \
    --num_train_epochs="20" \
    --logging_steps="500" \
    --save_steps="2500" \
    --eval_steps="2500" \
    --push_to_hub

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
model		model
README.md		README.md
create_config.py		create_config.py
run.sh		run.sh
run_stream_trainer.py		run_stream_trainer.py
run_trainer.py		run_trainer.py
train_tokenizer.py		train_tokenizer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpt2-bengali

Pretraining Process

Create configuration

Train model

About

Releases

Packages

Languages

khalidsaifullaah/gpt2-bengali

Folders and files

Latest commit

History

Repository files navigation

gpt2-bengali

Pretraining Process

Create configuration

Train model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages