-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variable batch size and LR scheduler #7020
Closed
bm-synth
wants to merge
3,319
commits into
deepspeedai:master
from
bm-synth:variable_batch_size_and_lr
Closed
Variable batch size and LR scheduler #7020
bm-synth
wants to merge
3,319
commits into
deepspeedai:master
from
bm-synth:variable_batch_size_and_lr
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We use simple model + deepspeed zero 3 + torch.compile and count graph break numbers to demonstrate current status of combing deepspeed + torch.compile. --------- Co-authored-by: Masahiro Tanaka <[email protected]>
…eepspeedai#6583) HF accelerate implemented fixes here: huggingface/accelerate#3131 This means we can revert the changes from deepspeedai#6574
…ch workflow triggers (deepspeedai#6584) Changes from deepspeedai#6472 caused the no-torch workflow that is an example of how we build the DeepSpeed release package to fail (so we caught this before a release, see more in deepspeedai#6402). These changes also copy the style used to include torch in other accelerator op_builder implementations, such as npu [here](https://github.com/microsoft/DeepSpeed/blob/master/op_builder/npu/fused_adam.py#L8) and hpu [here](https://github.com/microsoft/DeepSpeed/blob/828ddfbbda2482412fffc89f5fcd3b0d0eba9a62/op_builder/hpu/fused_adam.py#L15). This also updates the no-torch workflow to run on all changes to the op_builder directory. The test runs quickly and shouldn't add any additional testing burden there. Resolves: deepspeedai#6576
Fixes deepspeedai#6585 Use shell=True for subprocess.check_output() in case of ROCm commands. Do not use shlex.split() since command string has wildcard expansion. Signed-off-by: Jagadish Krishnamoorthy <[email protected]>
Work in progress to ensure we meet SSF best practices: https://www.bestpractices.dev/en/projects/9530
SD workflow needed updates when we moved to pydantic 2 support that was never added before. Passing nv-sd workflow [here](https://github.com/microsoft/DeepSpeed/actions/runs/11239699283)
Llama3.2-11b and llama3.2-90b including vision model and text model, these two models have different num_kv_heads, so we need to set num_kv_heads dynamically. Co-authored-by: Logan Adams <[email protected]>
Disable `steps_per_print` by default.
This PR addresses deepspeedai#5818. Instead of contiguous numbers based on the device count, this PR uses device indices in `--include`. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>
cc @tohtana Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]>
This PR mainly handles all places where InferenceBuilder is used to access any op or a specific implementation for an op. Instead an op is defined, and its proper implementation is picked inside and the usage will be transparent to the user. What was done in the PR: 1) Added missing ops (added a py file with fallback mechanism) 2) Added missing fallback implementations for existing ops 3) removed all usages for builder.load and replaced them with ops instead. 4) added workspace op and inferenceContext which contains all workspace related functions and inferenceContext is the python fallback of inferenceContext in CUDA 5) a small change to softmax_context signature to fit the fallback signature. --------- Co-authored-by: Joe Mayer <[email protected]> Co-authored-by: Lev Kurilenko <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
…eedai#6611) HF accelerate fixes implemented in huggingface/accelerate#3145 mean that we no longer need to pin the Accelerate version! nv-lightning tests now run on Ubuntu 20.04+, so we support >node 16, so we can remove the explicit permissions for that in the env config.
Modified _replace_module in auto_tp.py : The modification keeps the layers 'shared_expert_gate' and 'gate' in qwen2-moe the original type torch.nn.Linear and not changes them into LinearLayer. In this way, their weights will not be split into multiple HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards. Since the weights of 'gate' are not split into multiple HPU/GPU cards, all gather operations are not needed, which may improve performance. --------- Co-authored-by: Logan Adams <[email protected]>
**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.15.2 Author - @jomayeri Co-authored-by: jomayeri <[email protected]>
Parameters prefetched by ZeRO3 are sometimes not used. This occurs when the actual sub-module execution differs from previous tracing. As a result, the state of the allgather handle for such a parameter remains `INFLIGHT`, causing functions like `empty_partition_cache` to detect it and throw an error. This PR resolves the issue by ensuring that communication finishes and the parameters are freed. As this issue was mentioned in deepspeedai#6011, this includes the change of the branch. We need to merge deepspeedai#6011 first. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
Restoring the functionality of the cpu locked tensor in the AIO library. Make async_io operator available for CPU accelerator, i.e., CPU only environment. --------- Co-authored-by: Olatunji Ruwase <[email protected]>
…eepspeedai#6541) setting global variables during training will create a graph breaks when using torch.compile (reading global variables doesn't). this commit attempts to reduce the setting of global variables in the checkpointing flows. there are 2 main uses setting global variables: 1. Share data between functions 2. Establish that this is the first call to the code For most of the cases the data in the global variables is data that can be computed on demand or set once in an initial state in a configure function. For "check that this is the first run" use case the code was moved to the configure function. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Logan Adams <[email protected]>
This PR adds an API `deepspeed.runtime.zero.offload_states get_state_devices`, which gets devices of offload states as suggested in this [comment](deepspeedai#6011 (comment)). We could lift this up to `deepspeed.utils` but would need to resolve a circular import: User code -> `deepspeed.utils` -> `deepspeed.utils.offload_states` -> `deepspeed.runtime.zero` -> `deepspeed.runtime.zero.partition_parameters` -> `deepspeed.utils` This will require a significant refactoring as long as we have `OffloadStateTypeEnum` in `deepspeed.runtime.zero`. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
Tests with `reuse_dist_env = True` often causes memory leaks. This PR ignores `reuse_dist_env` and forcibly sets it to `False`. This change might slow down the tests, but I think it is better to manually restart runners and relaunch tests. Memory usages (See deepspeedai#6578): - `reuse_dist_env == True`: https://github.com/microsoft/DeepSpeed/actions/runs/11302940871/job/31439471512 - `reuse_dist_env == False`: https://github.com/microsoft/DeepSpeed/actions/runs/11303250613/job/31440137894
This PR extends deepspeedai#6570 by showing a breakdown of graph breaks. So we can see how graph breaks are distributed among different reasons. An example of graph break output can be seen from the following workflow run https://github.com/microsoft/DeepSpeed/actions/runs/11199157962
) This patch fixes issue deepspeedai#4460. When `btl_tcp_if_include` option is provided through `--launcher_args`, we use the provided option instead of the hardcoded `--mca btl_tcp_if_include eth0`. Otherwise we use `--mca btl_tcp_if_include eth0` as the default for compatibility. Fixes deepspeedai#4460 --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
Some (not all) of the LR schedulers in runtime were missing the initialization of the optimizer group lr. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>
## Feature This commit implements the following features: - [x] support saving checkpoint as safetensors (more commonly used format) - [x] support sharding checkpoints (which is important for very large models) Most of the codes are borrowed from https://github.com/huggingface/transformers/blob/v4.45.1/src/transformers/modeling_utils.py#L2490 ## Usage For `pytorch_model.bin` export ``` python zero_to_fp32.py . output_dir/ ``` For `model.safetensors` export ``` python zero_to_fp32.py . output_dir/ --safe_serialization ``` --------- Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Logan Adams <[email protected]>
…eepspeedai#6496) adding an option to disable calls for logger while compiling to avoid graph breaks. Here I used an environment variable to determine whether to activate this option, but it can also be determined using the json config file or any other way you see fit. --------- Co-authored-by: snahir <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]>
The error in the following log suggests that the cache file for HF model list can be broken: https://github.com/microsoft/DeepSpeed/actions/runs/11343665365/job/31546708118?pr=6614 The actual cause of the above error is unclear, but `_hf_model_list` potentially breaks the cache file when it is concurrently called from multiple processes. This PR locks the cache file to ensure `_hf_model_list` safely reads and writes the file.
Description This PR includes Tecorigin SDAA accelerator support. With this PR, DeepSpeed supports SDAA as backend for training tasks. --------- Signed-off-by: siqi <[email protected]> Co-authored-by: siqi <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>
More information on libuv in pytorch: https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html Issue tracking the prevalence of the error on Windows (unresolved at the time of this PR): pytorch/pytorch#139990 LibUV github: https://github.com/libuv/libuv Windows error: ``` File "C:\hostedtoolcache\windows\Python\3.12.7\x64\Lib\site-packages\torch\distributed\rendezvous.py", line 189, in _create_c10d_store return TCPStore( ^^^^^^^^^ RuntimeError: use_libuv was requested but PyTorch was build without libuv support ``` use_libuv isn't well supported on Windows in pytorch <2.4, so we need to guard around this case. --------- Signed-off-by: Logan Adams <[email protected]>
Signed-off-by: Logan Adams <[email protected]>
@fukun07 and I discovered a bug when using the `offload_states` and `reload_states` APIs of the Zero3 optimizer. When using grouped parameters (for example, in weight decay or grouped lr scenarios), the order of the parameters mapping in `reload_states` ([here](https://github.com/deepspeedai/DeepSpeed/blob/14b3cce4aaedac69120d386953e2b4cae8c2cf2c/deepspeed/runtime/zero/stage3.py#L2953)) does not correspond with the initialization of `self.lp_param_buffer` ([here](https://github.com/deepspeedai/DeepSpeed/blob/14b3cce4aaedac69120d386953e2b4cae8c2cf2c/deepspeed/runtime/zero/stage3.py#L731)), which leads to misaligned parameter loading. This issue was overlooked by the corresponding unit tests ([here](https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/runtime/zero/test_offload_states.py)), so we fixed the bug in our PR and added the corresponding unit tests. --------- Signed-off-by: Wei Wu <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Logan Adams <[email protected]>
Following changes in Pytorch trace rules , my previous PR to avoid graph breaks caused by logger is no longer relevant. So instead I've added this functionality to torch dynamo - pytorch/pytorch@16ea0dd This commit allows the user to config torch to ignore logger methods and avoid associated graph breaks. To enable ignore logger methods - os.environ["DISABLE_LOGS_WHILE_COMPILING"] = "1" To ignore logger methods except for a specific method / methods (for example, info and isEnabledFor) - os.environ["DISABLE_LOGS_WHILE_COMPILING"] = "1" and os.environ["LOGGER_METHODS_TO_EXCLUDE_FROM_DISABLE"] = "info, isEnabledFor" Signed-off-by: ShellyNR <[email protected]> Co-authored-by: snahir <[email protected]>
The partition tensor doesn't need to move to the current device when meta load is used. Signed-off-by: Lai, Yejing <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
…t` (deepspeedai#7069) With future changes coming to pip/python/etc, we need to modify to no longer call `python setup.py ...` and replace it instead: https://packaging.python.org/en/latest/guides/modernize-setup-py-project/#should-setup-py-be-deleted  This means we need to install the build package which is added here as well. Additionally, we pass the `--sdist` flag to only build the sdist rather than the wheel as well here. --------- Signed-off-by: Logan Adams <[email protected]>
…eepspeedai#7076) This reverts commit 8577bd2. Fixes: deepspeedai#7072
Add deepseekv3 autotp. Signed-off-by: Lai, Yejing <[email protected]>
Fixes: deepspeedai#7082 --------- Signed-off-by: Logan Adams <[email protected]>
Latest transformers causes failures when cpu-torch-latest test, so we pin it for now to unblock other PRs. --------- Signed-off-by: Logan Adams <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
…/runner (deepspeedai#7086) Signed-off-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
These jobs haven't been run in a long time and were originally used when compatibility with torch <2 was more important. Signed-off-by: Logan Adams <[email protected]>
@loadams done. Fixed formatting workflow:
Ran: pip show pre-commit clang-format
pre-commit run --all-files All pre-commit hooks passed. |
…W604 backticks are deprecated, use 'repr()'
…nstead of 'get_accelerator()'
…nstead of 'get_accelerator()' Signed-off-by: Bruno Magalhaes <[email protected]>
…Speed into variable_batch_size_and_lr Signed-off-by: Bruno Magalhaes <[email protected]>
477733f
to
4e20c47
Compare
4e20c47
to
e08d369
Compare
bm-synth
added a commit
to bm-synth/DeepSpeed
that referenced
this pull request
Mar 3, 2025
Signed-off-by: Bruno Magalhaes <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background and rationale
In many use cases, particularly LLMs, one is faced with inputs (sentences) of variable lengths. A common practice is to pack batches by token count (not a fixed batch size), ie by putting together sentences whose given metric (eg sequence lengths) will add up to an user-provided value. As an example, in Attention is all you need, section 5.1:
Dynamic batch sizes has been requested in DeepSpeed issue 1051, DeepSpeed issue 3455 , Pytorch Lightning issue 16914, huggingface issue 2647 and is available already in many libraries e.g. NVIDIA Triton and Meta FairSeq (implementation here ).
The immediate use case for this is when one needs to maximize GPU utilization. Moreover, this is particularly relevant for curriculum learning where a
BxTxE
(Batch x Time x Embedding) -shaped input should ideally have highB
and lowT
at the early curriculum steps (many short sentences packed together as a batch), and lowB
and highT
at the late steps (few long sentences in the batch). A dynamic sizeT
is already supported by Deepspeed, e.g. in the documentation for pipeline parallelism's reset_activation_shape():However, dynamic
B
is not supported. A dynamicB
would require an adequate increase/decrease of learning rate. This technique has been applied previously, and the two most common LR scaling algorithms have been described as:In practice, the user picks the total token count per batch as the metric that drives batching, instead of batching by sentence count. During runtime, the variable batch size is computed and the LR is adjusted respectively, based on the LR and batch size provided by the config.
Illustration of dynamic batch size, sequence length and LR
Imagine we picked a limit of
30
tokens per batch, and have set a referencelr=1e-3
for atrain_batch_size=2
(in the deepspeed config). The batching algorithm for curriculum may pack the data into batches of short sentences (left) at the early stages, and batches of long sentences (right) as later stages, e.g.:Above, we collected samples until we filled up the batch with at most 30 tokens. The batch sizes (number of samples) became then
10
and4
on the left and right examples, respectively. Using the linear scaling rule, the LR for those batches become5e-3
and2e-3
.Pipeline parallelism
Pipeline parallelism requires the same batch size and same sequence length across all micro-batches in a batch, as the activation sizes must be fixed between gradient accumulation steps. Between batches, these may change, and long as
engine.reset_activation_shape()
is called so that the new shapes are communicated on the first gradient accumulation step in the batch. Enforcing similarBxTxE
between batches may lead to smaller micro-batches. As an example, below we can see an illustration of a 2-node 2-gradient-accumulation-step (ie 4 micro-batches) batching for the same dataset, when preparing data for the regular DDP (left) and for the pipeline parallelism use cases (right):We can see that the pipeline use case (right) has the same
BxTxE
shape across all the 4 micro-batches in the same batch, and in order to respect that, it packs less samples in the batch, when compared to the standard use case (left hand size)Attention Head
For an input of size
BxTxE
the attention has a shape ofTxT
for a mask of fixed size across samples of same size, orBxTxT
for a different mask per sample (when samples have different sizes, as in the dataset above). This 3D attention matrix can be illustrated for the DDP microbatch 1 (picture above top-left, 4 sentences) as:Note the memory savings: the attention head has a size of
BxTxT
, i.e. a linear memory dependency on the batch sizeB
and quadratic memory dependency on the largest sequence lengthT
in the (micro-) batch. Thus, supporting a dynamic sizeT
allows for an increase ofB
.PR overview
This PRs implements dynamic batching and LR scaling. The dataloader and LR scheduler necessary can be retrieved by calling
get_dataloader_and_lr_scheduler_for_variable_batch_size
. A small explanation of that function follows:scale_lr
;batch_by_seqlen
.True
the following parameters:required_microbatches_of_same_sizes
that will force theB
dimension to be the same across all gradient accumulation steps of all dataloaders on a batch;required_microbatches_of_same_lengths
that will force theT
dimension to be the same across all gradient accumulation steps. Works by calling the user-providedsample_padding_fn(sentence, len)
that pads a given sentence to the argument length;batch_by_seqlen
returnsmicrobatch_sample_ids
(the list of sample ids per micro-batch),batch_sizes
(the size of effective batch sizes, andbatch_max_seqlens
(longest sequence across all microbatches in a batch)dataloader_for_variable_batch_size
relies onmicrobatch_sample_ids
and will iterate/collate/pad samples for every batch and return a dataloader that iterates the final (variable-size) batches;lr_scheduler_for_variable_batch_size
relies onbatch_sizes
to compute the learning rate for each effective batch, taking into account the batch size and LR in the config file, and scaling the LR based on the size of each effective batch, and the scaling rule mentioned above (Linear, Square root, etc).lr_scheduler
returned that will either accept either:Optimizer
that will scale the learning rates (in param groups) at every batch, orLRScheduler
, that in this case will first get the learning rate from the scheduler and then scale it accordingly.Example
An example for the use case with and without pipelining is provided in
deepspeed/runtime/data_pipeline/data_sampling/variable_batch_size_and_lr_example.py
. The example shows an attention head with attention of variable-sizedBxTxT
per batch, followed by a fixed size feed forward network. These are the main blocks on a Large Language Model. The feed-forward (or linear layer) that follows the attention head requires a constant input size, equivalent to the largest sentence in the whole dataset, so the output of the attention must be padded (seefeedforward: needs to convert BxTxE to BxMxE by padding extra tokens
in the code).The output of the
variable_batch_size_and_lr_example
is the following:Config
The example file also comments the relevant deepspeed config with comments:
Future work
A follow-up PR will enable dynamic batching when calling
deepspeed.initialize
. I.e. instead of this:we'd ideally have this:
where
initialize
will call internallyget_dataloader_and_lr_scheduler_for_variable_batch_size_deepspeed
.