Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Zero3 Mode and State Dictionary Saving - Related to Issue 1271 #1458

Open
ericzhou571 opened this issue May 24, 2023 · 13 comments
Open

Comments

@ericzhou571
Copy link
Contributor

ericzhou571 commented May 24, 2023

strong related to #1271

Dear Fastchat team,

I hope this message finds you well. I am writing to report an ongoing issue related to the zero3 mode and state dictionary saving in our project. This problem is closely related to the previously closed issue with the identifier #1271.

In our current implementation, we are utilizing the trainer.hf_deepspeed_config_orig.is_zero3() function from the deepspeed config object to determine if our train script is operating in zero3 mode. Additionally, we have discovered an internal function within the deepspeed engine object called _zero3_consolidated_16bit_state_dict(). By leveraging this function, we are able to successfully gather the state_dict, which has resolved the issue with zero3 saving. Consequently, we now obtain a .bin file of the expected size and are able to successfully invoke the apply_lora.py function. But we are not 100% sure if we really save the correct lora weight.
here is whole code

Before proceeding further, we kindly request your expertise and assistance in validating the correctness of our approach. Specifically, we would appreciate a thorough review of our utilization of the internal function from the deepspeed engine. As mentioned earlier, our implementation addresses the same problem that was previously raised in #1271, but remained unresolved despite the release of a fix code by the original author.

To facilitate the review process and maintain the link with the original issue, we kindly ask you to open a new issue, clearly indicating the connection to #1271. This will allow the community to assess our proposed fix and confirm its effectiveness in resolving the zero3 mode and state dictionary saving problem.

We appreciate your attention to this matter and look forward to your guidance and insights. If you require any additional information or code snippets to support the review process, please do not hesitate to let us know.

To facilitate the review process and maintain the connection with the original issue, we have provided the relevant code snippet below:

    # check if zero3 mode enabled
    if trainer.hf_deepspeed_config_orig.is_zero3():
        # use deepspeed engine internal function to gather state dict
        # state_dict_zero3 contains whole parameters of base and lora adapters
        # we will not extract lora parameters since peft save_pretrained will do that
        # https://github.com/huggingface/peft/blob/3714aa2fff158fdfa637b2b65952580801d890b2/src/peft/peft_model.py#L125
        # https://github.com/huggingface/peft/blob/3714aa2fff158fdfa637b2b65952580801d890b2/src/peft/utils/save_and_load.py#L19
        state_dict_zero3 = trainer.model_wrapped._zero3_consolidated_16bit_state_dict()
        if training_args.local_rank == 0:
            state_dict = state_dict_zero3
    else:
        # in other mode we use original code from fastchat team, to make sure our change is minimum
        state_dict = get_peft_state_maybe_zero_3(
        model.named_parameters(), lora_args.lora_bias
        )   

Thank you for your time and support.

Best regards,

@ericzhou571
Copy link
Contributor Author

we submit our fix with a pull request:#1457

@wwwadx
Copy link

wwwadx commented Jun 18, 2023

Hello, I have encountered an error:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ubuntu/FastChat/fastchat/train/train_lora.py:201 in │
│ │
│ 198 │
│ 199 │
│ 200 if name == "main": │
│ ❱ 201 │ train() │
│ 202 │
│ │
│ /home/ubuntu/FastChat/fastchat/train/train_lora.py:181 in train │
│ │
│ 178 │ trainer.save_state() │
│ 179 │ │
│ 180 │ # check if zero3 mode enabled │
│ ❱ 181 │ if trainer.hf_deepspeed_config_orig.is_zero3(): │
│ 182 │ │ # use deepspeed engine internal function to gather state dict │
│ 183 │ │ # state_dict_zero3 contains whole parameters of base and lora adapters │
│ 184 │ │ # we will not extract lora parameters since peft save_pretrained will do that │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'Trainer' object has no attribute 'hf_deepspeed_config_orig'

Do you know how to fix this?

@zhangmozhe
Copy link

Hello, I have encountered an error: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ubuntu/FastChat/fastchat/train/train_lora.py:201 in │ │ │ │ 198 │ │ 199 │ │ 200 if name == "main": │ │ ❱ 201 │ train() │ │ 202 │ │ │ │ /home/ubuntu/FastChat/fastchat/train/train_lora.py:181 in train │ │ │ │ 178 │ trainer.save_state() │ │ 179 │ │ │ 180 │ # check if zero3 mode enabled │ │ ❱ 181 │ if trainer.hf_deepspeed_config_orig.is_zero3(): │ │ 182 │ │ # use deepspeed engine internal function to gather state dict │ │ 183 │ │ # state_dict_zero3 contains whole parameters of base and lora adapters │ │ 184 │ │ # we will not extract lora parameters since peft save_pretrained will do that │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ AttributeError: 'Trainer' object has no attribute 'hf_deepspeed_config_orig'

Do you know how to fix this?

The same issue.

@lucasjinreal
Copy link

Just upgrade to latest transformers, I can save correctly

@ericzhou571
Copy link
Contributor Author

ericzhou571 commented Jun 20, 2023

Hello, I have encountered an error: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ubuntu/FastChat/fastchat/train/train_lora.py:201 in │ │ │ │ 198 │ │ 199 │ │ 200 if name == "main": │ │ ❱ 201 │ train() │ │ 202 │ │ │ │ /home/ubuntu/FastChat/fastchat/train/train_lora.py:181 in train │ │ │ │ 178 │ trainer.save_state() │ │ 179 │ │ │ 180 │ # check if zero3 mode enabled │ │ ❱ 181 │ if trainer.hf_deepspeed_config_orig.is_zero3(): │ │ 182 │ │ # use deepspeed engine internal function to gather state dict │ │ 183 │ │ # state_dict_zero3 contains whole parameters of base and lora adapters │ │ 184 │ │ # we will not extract lora parameters since peft save_pretrained will do that │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ AttributeError: 'Trainer' object has no attribute 'hf_deepspeed_config_orig'

Do you know how to fix this?

do you use deepspeed or use torchrun?

@ericzhou571
Copy link
Contributor Author

ericzhou571 commented Jun 20, 2023

Hello, I have encountered an error: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ubuntu/FastChat/fastchat/train/train_lora.py:201 in │ │ │ │ 198 │ │ 199 │ │ 200 if name == "main": │ │ ❱ 201 │ train() │ │ 202 │ │ │ │ /home/ubuntu/FastChat/fastchat/train/train_lora.py:181 in train │ │ │ │ 178 │ trainer.save_state() │ │ 179 │ │ │ 180 │ # check if zero3 mode enabled │ │ ❱ 181 │ if trainer.hf_deepspeed_config_orig.is_zero3(): │ │ 182 │ │ # use deepspeed engine internal function to gather state dict │ │ 183 │ │ # state_dict_zero3 contains whole parameters of base and lora adapters │ │ 184 │ │ # we will not extract lora parameters since peft save_pretrained will do that │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ AttributeError: 'Trainer' object has no attribute 'hf_deepspeed_config_orig'

Do you know how to fix this?

do you use torchrun?

@ericzhou571
Copy link
Contributor Author

Just upgrade to latest transformers, I can save correctly

you use deepspeed right?

@ericzhou571
Copy link
Contributor Author

ericzhou571 commented Jun 20, 2023

    # check if zero3 mode enabled
    if trainer.args.deepspeed and trainer.hf_deepspeed_config_orig.is_zero3():
        # use deepspeed engine internal function to gather state dict
        # state_dict_zero3 contains whole parameters of base and lora adapters
        # we will not extract lora parameters since peft save_pretrained will do that
        # https://github.com/huggingface/peft/blob/3714aa2fff158fdfa637b2b65952580801d890b2/src/peft/peft_model.py#L125
        # https://github.com/huggingface/peft/blob/3714aa2fff158fdfa637b2b65952580801d890b2/src/peft/utils/save_and_load.py#L19
        state_dict_zero3 = trainer.model_wrapped._zero3_consolidated_16bit_state_dict()
        if training_args.local_rank == 0:
            state_dict = state_dict_zero3
    else:
        # in other mode we use original code from fastchat team, to make sure our change is minimum
        state_dict = get_peft_state_maybe_zero_3(
        model.named_parameters(), lora_args.lora_bias
        )   

This one should work, we first use trainer.args.deepspeed to judge if deepspeed enabled or not, only when deepspeed is enabled, will trainer have a parameter hf_deepspeed_config_orig

@yixuantt
Copy link

Hello! I am experiencing the same issue as well, which is an AttributeError: 'Trainer' object does not have the attribute 'hf_deepspeed_config_orig'. Any advice on which launcher to use or how to resolve this issue would be greatly appreciated. Thank you! (torchrun or deepspeed?)

@ericzhou571
Copy link
Contributor Author

Hello! I am experiencing the same issue as well, which is an AttributeError: 'Trainer' object does not have the attribute 'hf_deepspeed_config_orig'. Any advice on which launcher to use or how to resolve this issue would be greatly appreciated. Thank you! (torchrun or deepspeed?)

could you provide your deepspeed torch and transformers version?

@ericzhou571
Copy link
Contributor Author

Could you also tell me, whether you just use fastchat current main branch? or you still face the problem after using the new code I provide above: #1458 (comment)?

@yixuantt
Copy link

deepspeed 0.9.5
transformers 4.30.2
torch 11.8
I just use fastchat current main branch.

@ericzhou571
Copy link
Contributor Author

ericzhou571 commented Jul 19, 2023

deepspeed 0.9.5 transformers 4.30.2 torch 11.8 I just use fastchat current main branch.

try to replace the if condition with "if trainer.args.deepspeed and trainer.hf_deepspeed_config_orig.is_zero3():"
like #1458 (comment)

if you still unclear, you can have a look of the hotfix pull request I just submit:#2003

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants