Llama will not save properly #1947

mfirth-truffle · 2024-10-06T21:16:34Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

When my model completes and I try to do inference with it it should load without error

Current behaviour

My model is missing parameters and thus errors out when loading

[2024-10-06 21:07:57,939] [ERROR] [axolotl.load_model:808] [PID:45370] [RANK:0] Error(s) in loading state_dict for LlamaForCausalLM:
        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([131344896]) from checkpoint, the shape in current model is torch.Size([128266, 4096]).
        size mismatch for model.norm.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for lm_head.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([128266, 4096]).
        You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
Traceback (most recent call last):
  File "/root/axolotl/src/axolotl/utils/models.py", line 710, in load_model
    model = AutoModelLoader.from_pretrained(
  File "/root/.venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
  File "/root/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4014, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/root/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4559, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([131344896]) from checkpoint, the shape in current model is torch.Size([128266, 4096]).
        size mismatch for model.norm.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for lm_head.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([128266, 4096]).
        You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

Steps to reproduce

Train a model with my config, and any pre-tokenized dataset, and then try to run it

Config yaml

base_model: meta-llama/Llama-3.1-8B-Instruct
tokenizer_config: ./tokenizer
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
is_llama_derived_model: true

save_safetensors: true

datasets:
  - path: ./processed_data.jsonl
    ds_type: json
    split: train[]
    type:

dataset_prepared_path: ./last_run_prepared

output_dir: ./models
sequence_len: 8192

wandb_project: llama-3.1-8b-inst
wandb_name: llama-3.1-8b-inst

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
learning_rate: 2e-5

bf16: auto
fp16:
tf32: false

logging_steps: 10
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 2
save_steps: 1
weight_decay: 0.0

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: false
  fsdp_use_orig_params: true
  fsdp_cpu_ram_efficient_loading: false
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_backward_prefetch: BACKWARD_PRE

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

mfirth-truffle · 2024-10-07T06:21:17Z

For any future people who may stumble across this, just don't use FSDP

NanoCode012 · 2024-10-14T17:30:34Z

Did the model total size bloat / appear much different from the original's size?

bursteratom · 2024-10-17T17:16:11Z

Hi @mfirth-truffle

I used a similar configuration file to train the model, and was able to do inference without running into error.

I made sure that my FSDP configurations are the same as your yml.

Here is mine:

base_model: meta-llama/Llama-3.1-8B-Instruct

save_safetensors: true

datasets:
  - path: teknium/GPT4-LLM-Cleaned
    type: alpaca

dataset_prepared_path: ./last_run_prepared

output_dir: ./outputs/fft-out
sequence_len: 8192

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
learning_rate: 2e-5

bf16: auto
fp16:
tf32: false

logging_steps: 10
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 2
save_steps: 2
max_steps: 5
weight_decay: 0.0

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: false
  fsdp_use_orig_params: true
  fsdp_cpu_ram_efficient_loading: false
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_backward_prefetch: BACKWARD_PRE

special_tokens:
  pad_token: "<|end_of_text|>"

Perhaps your issue has to do with your (presumably) customised tokenizer config? Would you be able to provide me that so I can dig deeper? Thanks!

mfirth-truffle added the bug Something isn't working label Oct 6, 2024

NanoCode012 assigned bursteratom Oct 29, 2024

NanoCode012 added the waiting for reporter label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama will not save properly #1947

Llama will not save properly #1947

mfirth-truffle commented Oct 6, 2024 •

edited

Loading

mfirth-truffle commented Oct 7, 2024

NanoCode012 commented Oct 14, 2024

bursteratom commented Oct 17, 2024 •

edited

Loading

Llama will not save properly #1947

Llama will not save properly #1947

Comments

mfirth-truffle commented Oct 6, 2024 • edited Loading

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

mfirth-truffle commented Oct 7, 2024

NanoCode012 commented Oct 14, 2024

bursteratom commented Oct 17, 2024 • edited Loading

mfirth-truffle commented Oct 6, 2024 •

edited

Loading

bursteratom commented Oct 17, 2024 •

edited

Loading