Not able to set early_stopping_patience because error with load_best_model_at_end=True. Avoiding over-fitting. #2352

sneha117 · 2025-01-29T05:32:01Z

sneha117
Jan 29, 2025

I am using this code to finetune whisper on custom dataset. I was trying to set early_stopping_patience but i am getting error when I set load_best_model_at_end=True, what can i do to use early stopping to avoid overfitting.

I am doing this because i tired to not set max_steps, but got very bad accuracy. so then i started setting max_steps to as high as 6000 and the results are amazing. But I am scared of overfitting.

Is there any other workaround for avoiding overfitting in this scenario.

BenjaminBossan · 2025-01-29T10:31:20Z

BenjaminBossan
Jan 29, 2025
Maintainer

Could you please share the full error message and what changes you made to the notebook that resulted in this error?

I did a quick test with:

training_args = Seq2SeqTrainingArguments(
    ...
    eval_strategy="steps",
    save_strategy="steps",
    save_steps=5,
    load_best_model_at_end=True,
)

from transformers import EarlyStoppingCallback

trainer = Seq2SeqTrainer(
    ...
    callbacks=[
        EarlyStoppingCallback(early_stopping_patience=2)
    ],
)

and I didn't get any error.

0 replies

sneha117 · 2025-01-30T04:38:03Z

sneha117
Jan 30, 2025
Author

My training args
training_args = Seq2SeqTrainingArguments(
output_dir=dir_path, # change to a repo name of your choice
per_device_train_batch_size=8,
gradient_accumulation_steps=2, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
max_steps=10000,
warmup_steps=500,
eval_strategy="steps",
fp16=True,
per_device_eval_batch_size=8,
generation_max_length=128,
logging_steps=25,
save_steps=100,
eval_steps=100,
save_total_limit=2,
remove_unused_columns=False, # required as the PeftModel forward doesn't have the signature of the wrapped model's forward
label_names=["labels"], # same reason as above
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=False,
)

trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=youtube_dataset["train"],
eval_dataset=youtube_dataset["test"],
data_collator=data_collator,
compute_metrics=compute_metrics,
processing_class=processor.feature_extractor,
callbacks=[SavePeftModelCallback, EarlyStoppingCallback(early_stopping_patience=2, early_stopping_threshold=0.01)],
# callbacks=[SavePeftModelCallback]
)

7%|▋ | 725/10000 [11:28:17<52:44:42, 20.47s/it]
100%|██████████| 300/300 [15:08<00:00, 2.95s/it]
Could not locate the best model at /home/xtend/PycharmProjects/Whisper_multiGPU/Training_dir/whisper_PEFT_large_v3_10000/checkpoint-725/pytorch_model.bin, if you are running a distributed training on multiple nodes, you should activate --save_on_each_node.
7%|▋ | 725/10000 [11:28:17<146:45:23, 56.96s/it]
{'eval_loss': 1.3500980138778687, 'eval_runtime': 912.3616, 'eval_samples_per_second': 2.631, 'eval_steps_per_second': 0.329, 'epoch': 2.07}
{'train_runtime': 41297.5717, 'train_samples_per_second': 3.874, 'train_steps_per_second': 0.242, 'train_loss': 2.9608362132105333, 'epoch': 2.07}

Process finished with exit code 0

This is what I get.

1 reply

BenjaminBossan Jan 30, 2025
Maintainer

With the given code, I don't see anything wrong. Could you check the contents of /home/xtend/PycharmProjects/Whisper_multiGPU/Training_dir/whisper_PEFT_large_v3_10000/ and the subfolders, like checkpoint-725, and report what's in there?

Also, is there more to the error message? If yes, please paste the whole error message.

Finally, could you try removing the SavePeftModelCallback?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to set early_stopping_patience because error with load_best_model_at_end=True. Avoiding over-fitting. #2352

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Not able to set early_stopping_patience because error with load_best_model_at_end=True. Avoiding over-fitting. #2352

sneha117 Jan 29, 2025

Replies: 2 comments · 1 reply

BenjaminBossan Jan 29, 2025 Maintainer

sneha117 Jan 30, 2025 Author

BenjaminBossan Jan 30, 2025 Maintainer

sneha117
Jan 29, 2025

Replies: 2 comments 1 reply

BenjaminBossan
Jan 29, 2025
Maintainer

sneha117
Jan 30, 2025
Author

BenjaminBossan Jan 30, 2025
Maintainer