-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with Zero3 Mode and State Dictionary Saving - Related to Issue 1271 #1458
Comments
we submit our fix with a pull request:#1457 |
Hello, I have encountered an error: Do you know how to fix this? |
The same issue. |
Just upgrade to latest transformers, I can save correctly |
do you use deepspeed or use torchrun? |
do you use torchrun? |
you use deepspeed right? |
# check if zero3 mode enabled
if trainer.args.deepspeed and trainer.hf_deepspeed_config_orig.is_zero3():
# use deepspeed engine internal function to gather state dict
# state_dict_zero3 contains whole parameters of base and lora adapters
# we will not extract lora parameters since peft save_pretrained will do that
# https://github.com/huggingface/peft/blob/3714aa2fff158fdfa637b2b65952580801d890b2/src/peft/peft_model.py#L125
# https://github.com/huggingface/peft/blob/3714aa2fff158fdfa637b2b65952580801d890b2/src/peft/utils/save_and_load.py#L19
state_dict_zero3 = trainer.model_wrapped._zero3_consolidated_16bit_state_dict()
if training_args.local_rank == 0:
state_dict = state_dict_zero3
else:
# in other mode we use original code from fastchat team, to make sure our change is minimum
state_dict = get_peft_state_maybe_zero_3(
model.named_parameters(), lora_args.lora_bias
) This one should work, we first use trainer.args.deepspeed to judge if deepspeed enabled or not, only when deepspeed is enabled, will trainer have a parameter |
Hello! I am experiencing the same issue as well, which is an AttributeError: 'Trainer' object does not have the attribute 'hf_deepspeed_config_orig'. Any advice on which launcher to use or how to resolve this issue would be greatly appreciated. Thank you! (torchrun or deepspeed?) |
could you provide your deepspeed torch and transformers version? |
Could you also tell me, whether you just use fastchat current main branch? or you still face the problem after using the new code I provide above: #1458 (comment)? |
deepspeed 0.9.5 |
try to replace the if condition with "if trainer.args.deepspeed and trainer.hf_deepspeed_config_orig.is_zero3():" if you still unclear, you can have a look of the hotfix pull request I just submit:#2003 |
strong related to #1271
Dear Fastchat team,
I hope this message finds you well. I am writing to report an ongoing issue related to the zero3 mode and state dictionary saving in our project. This problem is closely related to the previously closed issue with the identifier #1271.
In our current implementation, we are utilizing the
trainer.hf_deepspeed_config_orig.is_zero3()
function from the deepspeed config object to determine if our train script is operating in zero3 mode. Additionally, we have discovered an internal function within the deepspeed engine object called_zero3_consolidated_16bit_state_dict()
. By leveraging this function, we are able to successfully gather the state_dict, which has resolved the issue with zero3 saving. Consequently, we now obtain a .bin file of the expected size and are able to successfully invoke the apply_lora.py function. But we are not 100% sure if we really save the correct lora weight.here is whole code
Before proceeding further, we kindly request your expertise and assistance in validating the correctness of our approach. Specifically, we would appreciate a thorough review of our utilization of the internal function from the deepspeed engine. As mentioned earlier, our implementation addresses the same problem that was previously raised in #1271, but remained unresolved despite the release of a fix code by the original author.
To facilitate the review process and maintain the link with the original issue, we kindly ask you to open a new issue, clearly indicating the connection to #1271. This will allow the community to assess our proposed fix and confirm its effectiveness in resolving the zero3 mode and state dictionary saving problem.
We appreciate your attention to this matter and look forward to your guidance and insights. If you require any additional information or code snippets to support the review process, please do not hesitate to let us know.
To facilitate the review process and maintain the connection with the original issue, we have provided the relevant code snippet below:
Thank you for your time and support.
Best regards,
The text was updated successfully, but these errors were encountered: