You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi I'm running the examples provided in the official github repo.
I just simply run the command "python ilql_sentiments_t5.py"
However, I encountered into a runtime error
[RANK 0] Saving intermediate optimizer & model checkpoint into ckpts/checkpoint_1000
Traceback (most recent call last):
File "/home/user/workspace/trlx/examples/ilql_sentiments_t5.py", line 140, in
main()
File "/home/user/workspace/trlx/examples/ilql_sentiments_t5.py", line 130, in main
trlx.train(
File "/home/user/workspace/trlx/trlx/trlx.py", line 142, in train
trainer.learn()
File "/home/user/workspace/trlx/trlx/trainer/accelerate_base_trainer.py", line 598, in learn
self.save(directory)
File "/home/user/workspace/trlx/trlx/trainer/accelerate_base_trainer.py", line 312, in save
self.accelerator.save_state(dst_dir, **kwargs)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/accelerate/accelerator.py", line 2708, in save_state
save_location = save_accelerator_state(
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/accelerate/checkpointing.py", line 99, in save_accelerator_state
save(state, output_model_file, save_on_each_node=save_on_each_node, safe_serialization=safe_serialization)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/accelerate/utils/other.py", line 181, in save
save_func(obj, f)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/safetensors/torch.py", line 281, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/safetensors/torch.py", line 467, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'base_model.lm_head.weight', 'base_model.shared.weight', 'base_model.decoder.embed_tokens.weight', 'base_model.encoder.embed_tokens.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors
It seems like something about saving the model went into an error.
No idea about how to fix this. (Maybe I should revise the corresponding part of the source code of trlx???)
Thanks for your help!
Which trlX version are you using?
trlx==0.7.0
Additional system and package information
python 3.9.18, transformers 4.36.2, ubuntu 18.04
The text was updated successfully, but these errors were encountered:
I have the same issue as well, when I am running ppo_sentiments.py
I have an imperfect solution where I just don't save the optimizer and model during training. config.train.save_best = False config.train.save_optimizer = False
🐛 Describe the bug
Hi I'm running the examples provided in the official github repo.
I just simply run the command "python ilql_sentiments_t5.py"
However, I encountered into a runtime error
It seems like something about saving the model went into an error.
No idea about how to fix this. (Maybe I should revise the corresponding part of the source code of trlx???)
Thanks for your help!
Which trlX version are you using?
trlx==0.7.0
Additional system and package information
python 3.9.18, transformers 4.36.2, ubuntu 18.04
The text was updated successfully, but these errors were encountered: