Issue with Zero3 Mode and State Dictionary Saving - Related to Issue 1271 #1458

ericzhou571 · 2023-05-24T02:29:53Z

strong related to #1271

Dear Fastchat team,

I hope this message finds you well. I am writing to report an ongoing issue related to the zero3 mode and state dictionary saving in our project. This problem is closely related to the previously closed issue with the identifier #1271.

In our current implementation, we are utilizing the trainer.hf_deepspeed_config_orig.is_zero3() function from the deepspeed config object to determine if our train script is operating in zero3 mode. Additionally, we have discovered an internal function within the deepspeed engine object called _zero3_consolidated_16bit_state_dict(). By leveraging this function, we are able to successfully gather the state_dict, which has resolved the issue with zero3 saving. Consequently, we now obtain a .bin file of the expected size and are able to successfully invoke the apply_lora.py function. But we are not 100% sure if we really save the correct lora weight.
here is whole code

Before proceeding further, we kindly request your expertise and assistance in validating the correctness of our approach. Specifically, we would appreciate a thorough review of our utilization of the internal function from the deepspeed engine. As mentioned earlier, our implementation addresses the same problem that was previously raised in #1271, but remained unresolved despite the release of a fix code by the original author.

To facilitate the review process and maintain the link with the original issue, we kindly ask you to open a new issue, clearly indicating the connection to #1271. This will allow the community to assess our proposed fix and confirm its effectiveness in resolving the zero3 mode and state dictionary saving problem.

We appreciate your attention to this matter and look forward to your guidance and insights. If you require any additional information or code snippets to support the review process, please do not hesitate to let us know.

To facilitate the review process and maintain the connection with the original issue, we have provided the relevant code snippet below:

    # check if zero3 mode enabled
    if trainer.hf_deepspeed_config_orig.is_zero3():
        # use deepspeed engine internal function to gather state dict
        # state_dict_zero3 contains whole parameters of base and lora adapters
        # we will not extract lora parameters since peft save_pretrained will do that
        # https://github.com/huggingface/peft/blob/3714aa2fff158fdfa637b2b65952580801d890b2/src/peft/peft_model.py#L125
        # https://github.com/huggingface/peft/blob/3714aa2fff158fdfa637b2b65952580801d890b2/src/peft/utils/save_and_load.py#L19
        state_dict_zero3 = trainer.model_wrapped._zero3_consolidated_16bit_state_dict()
        if training_args.local_rank == 0:
            state_dict = state_dict_zero3
    else:
        # in other mode we use original code from fastchat team, to make sure our change is minimum
        state_dict = get_peft_state_maybe_zero_3(
        model.named_parameters(), lora_args.lora_bias
        )

Thank you for your time and support.

Best regards,

The text was updated successfully, but these errors were encountered:

ericzhou571 · 2023-05-24T02:31:03Z

we submit our fix with a pull request:#1457

wwwadx · 2023-06-18T08:40:52Z

Hello, I have encountered an error:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ubuntu/FastChat/fastchat/train/train_lora.py:201 in │
│ │
│ 198 │
│ 199 │
│ 200 if name == "main": │
│ ❱ 201 │ train() │
│ 202 │
│ │
│ /home/ubuntu/FastChat/fastchat/train/train_lora.py:181 in train │
│ │
│ 178 │ trainer.save_state() │
│ 179 │ │
│ 180 │ # check if zero3 mode enabled │
│ ❱ 181 │ if trainer.hf_deepspeed_config_orig.is_zero3(): │
│ 182 │ │ # use deepspeed engine internal function to gather state dict │
│ 183 │ │ # state_dict_zero3 contains whole parameters of base and lora adapters │
│ 184 │ │ # we will not extract lora parameters since peft save_pretrained will do that │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'Trainer' object has no attribute 'hf_deepspeed_config_orig'

Do you know how to fix this?

zhangmozhe · 2023-06-19T08:42:26Z

Hello, I have encountered an error: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ubuntu/FastChat/fastchat/train/train_lora.py:201 in │ │ │ │ 198 │ │ 199 │ │ 200 if name == "main": │ │ ❱ 201 │ train() │ │ 202 │ │ │ │ /home/ubuntu/FastChat/fastchat/train/train_lora.py:181 in train │ │ │ │ 178 │ trainer.save_state() │ │ 179 │ │ │ 180 │ # check if zero3 mode enabled │ │ ❱ 181 │ if trainer.hf_deepspeed_config_orig.is_zero3(): │ │ 182 │ │ # use deepspeed engine internal function to gather state dict │ │ 183 │ │ # state_dict_zero3 contains whole parameters of base and lora adapters │ │ 184 │ │ # we will not extract lora parameters since peft save_pretrained will do that │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ AttributeError: 'Trainer' object has no attribute 'hf_deepspeed_config_orig'

Do you know how to fix this?

The same issue.

lucasjinreal · 2023-06-20T02:24:41Z

Just upgrade to latest transformers, I can save correctly

ericzhou571 · 2023-06-20T07:19:26Z

Hello, I have encountered an error: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ubuntu/FastChat/fastchat/train/train_lora.py:201 in │ │ │ │ 198 │ │ 199 │ │ 200 if name == "main": │ │ ❱ 201 │ train() │ │ 202 │ │ │ │ /home/ubuntu/FastChat/fastchat/train/train_lora.py:181 in train │ │ │ │ 178 │ trainer.save_state() │ │ 179 │ │ │ 180 │ # check if zero3 mode enabled │ │ ❱ 181 │ if trainer.hf_deepspeed_config_orig.is_zero3(): │ │ 182 │ │ # use deepspeed engine internal function to gather state dict │ │ 183 │ │ # state_dict_zero3 contains whole parameters of base and lora adapters │ │ 184 │ │ # we will not extract lora parameters since peft save_pretrained will do that │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ AttributeError: 'Trainer' object has no attribute 'hf_deepspeed_config_orig'

Do you know how to fix this?

do you use deepspeed or use torchrun?

ericzhou571 · 2023-06-20T07:20:06Z

Hello, I have encountered an error: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/ubuntu/FastChat/fastchat/train/train_lora.py:201 in │ │ │ │ 198 │ │ 199 │ │ 200 if name == "main": │ │ ❱ 201 │ train() │ │ 202 │ │ │ │ /home/ubuntu/FastChat/fastchat/train/train_lora.py:181 in train │ │ │ │ 178 │ trainer.save_state() │ │ 179 │ │ │ 180 │ # check if zero3 mode enabled │ │ ❱ 181 │ if trainer.hf_deepspeed_config_orig.is_zero3(): │ │ 182 │ │ # use deepspeed engine internal function to gather state dict │ │ 183 │ │ # state_dict_zero3 contains whole parameters of base and lora adapters │ │ 184 │ │ # we will not extract lora parameters since peft save_pretrained will do that │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ AttributeError: 'Trainer' object has no attribute 'hf_deepspeed_config_orig'

Do you know how to fix this?

do you use torchrun?

ericzhou571 · 2023-06-20T07:21:40Z

Just upgrade to latest transformers, I can save correctly

you use deepspeed right?

ericzhou571 · 2023-06-20T08:04:33Z

    # check if zero3 mode enabled
    if trainer.args.deepspeed and trainer.hf_deepspeed_config_orig.is_zero3():
        # use deepspeed engine internal function to gather state dict
        # state_dict_zero3 contains whole parameters of base and lora adapters
        # we will not extract lora parameters since peft save_pretrained will do that
        # https://github.com/huggingface/peft/blob/3714aa2fff158fdfa637b2b65952580801d890b2/src/peft/peft_model.py#L125
        # https://github.com/huggingface/peft/blob/3714aa2fff158fdfa637b2b65952580801d890b2/src/peft/utils/save_and_load.py#L19
        state_dict_zero3 = trainer.model_wrapped._zero3_consolidated_16bit_state_dict()
        if training_args.local_rank == 0:
            state_dict = state_dict_zero3
    else:
        # in other mode we use original code from fastchat team, to make sure our change is minimum
        state_dict = get_peft_state_maybe_zero_3(
        model.named_parameters(), lora_args.lora_bias
        )

This one should work, we first use trainer.args.deepspeed to judge if deepspeed enabled or not, only when deepspeed is enabled, will trainer have a parameter hf_deepspeed_config_orig

yixuantt · 2023-07-13T01:18:45Z

Hello! I am experiencing the same issue as well, which is an AttributeError: 'Trainer' object does not have the attribute 'hf_deepspeed_config_orig'. Any advice on which launcher to use or how to resolve this issue would be greatly appreciated. Thank you! (torchrun or deepspeed?)

ericzhou571 · 2023-07-19T07:10:19Z

Hello! I am experiencing the same issue as well, which is an AttributeError: 'Trainer' object does not have the attribute 'hf_deepspeed_config_orig'. Any advice on which launcher to use or how to resolve this issue would be greatly appreciated. Thank you! (torchrun or deepspeed?)

could you provide your deepspeed torch and transformers version?

ericzhou571 · 2023-07-19T07:13:55Z

Could you also tell me, whether you just use fastchat current main branch? or you still face the problem after using the new code I provide above: #1458 (comment)?

yixuantt · 2023-07-19T07:22:10Z

deepspeed 0.9.5
transformers 4.30.2
torch 11.8
I just use fastchat current main branch.

ericzhou571 · 2023-07-19T07:24:04Z

deepspeed 0.9.5 transformers 4.30.2 torch 11.8 I just use fastchat current main branch.

try to replace the if condition with "if trainer.args.deepspeed and trainer.hf_deepspeed_config_orig.is_zero3():"
like #1458 (comment)

if you still unclear, you can have a look of the hotfix pull request I just submit:#2003

ericzhou571 mentioned this issue May 24, 2023

fix zero3 save problem with minimum change #1457

Merged

2 tasks

ericzhou571 mentioned this issue Jul 19, 2023

hotfix train_lora deepspeed problem #2003

Closed

3 tasks

XqZeppelinhead0702 mentioned this issue Dec 30, 2024

deepspeed problem with train_lora.py #3648

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Zero3 Mode and State Dictionary Saving - Related to Issue 1271 #1458

Issue with Zero3 Mode and State Dictionary Saving - Related to Issue 1271 #1458

ericzhou571 commented May 24, 2023 •

edited

Loading

ericzhou571 commented May 24, 2023

wwwadx commented Jun 18, 2023

zhangmozhe commented Jun 19, 2023

lucasjinreal commented Jun 20, 2023

ericzhou571 commented Jun 20, 2023 •

edited

Loading

ericzhou571 commented Jun 20, 2023 •

edited

Loading

ericzhou571 commented Jun 20, 2023

ericzhou571 commented Jun 20, 2023 •

edited

Loading

yixuantt commented Jul 13, 2023

ericzhou571 commented Jul 19, 2023

ericzhou571 commented Jul 19, 2023

yixuantt commented Jul 19, 2023

ericzhou571 commented Jul 19, 2023 •

edited

Loading

Issue with Zero3 Mode and State Dictionary Saving - Related to Issue 1271 #1458

Issue with Zero3 Mode and State Dictionary Saving - Related to Issue 1271 #1458

Comments

ericzhou571 commented May 24, 2023 • edited Loading

ericzhou571 commented May 24, 2023

wwwadx commented Jun 18, 2023

zhangmozhe commented Jun 19, 2023

lucasjinreal commented Jun 20, 2023

ericzhou571 commented Jun 20, 2023 • edited Loading

ericzhou571 commented Jun 20, 2023 • edited Loading

ericzhou571 commented Jun 20, 2023

ericzhou571 commented Jun 20, 2023 • edited Loading

yixuantt commented Jul 13, 2023

ericzhou571 commented Jul 19, 2023

ericzhou571 commented Jul 19, 2023

yixuantt commented Jul 19, 2023

ericzhou571 commented Jul 19, 2023 • edited Loading

ericzhou571 commented May 24, 2023 •

edited

Loading

ericzhou571 commented Jun 20, 2023 •

edited

Loading

ericzhou571 commented Jun 20, 2023 •

edited

Loading

ericzhou571 commented Jun 20, 2023 •

edited

Loading

ericzhou571 commented Jul 19, 2023 •

edited

Loading