Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable batch size and LR scheduler #7020

Closed
wants to merge 3,319 commits into from

Conversation

bm-synth
Copy link
Contributor

@bm-synth bm-synth commented Feb 8, 2025

Background and rationale

In many use cases, particularly LLMs, one is faced with inputs (sentences) of variable lengths. A common practice is to pack batches by token count (not a fixed batch size), ie by putting together sentences whose given metric (eg sequence lengths) will add up to an user-provided value. As an example, in Attention is all you need, section 5.1:

Sentence pairs were batched together by approximate sequence length. Each training
batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000
target tokens.

Dynamic batch sizes has been requested in DeepSpeed issue 1051, DeepSpeed issue 3455 , Pytorch Lightning issue 16914, huggingface issue 2647 and is available already in many libraries e.g. NVIDIA Triton and Meta FairSeq (implementation here ).

The immediate use case for this is when one needs to maximize GPU utilization. Moreover, this is particularly relevant for curriculum learning where a BxTxE (Batch x Time x Embedding) -shaped input should ideally have high B and low T at the early curriculum steps (many short sentences packed together as a batch), and low B and high T at the late steps (few long sentences in the batch). A dynamic size T is already supported by Deepspeed, e.g. in the documentation for pipeline parallelism's reset_activation_shape():

For curriculum learning that changes the seqlen of each sample, we need to call this whenever the seqlen is going to change.

However, dynamic B is not supported. A dynamic B would require an adequate increase/decrease of learning rate. This technique has been applied previously, and the two most common LR scaling algorithms have been described as:

  1. Linear Scaling Rule: "When the minibatch size is multiplied by k, multiply the learning rate by k", as in Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al.
  2. Square Root scaling: "when multiplying the batch size by k, multiply the learning rate by √k, to keep the variance in the gradient expectation constant" by One weird trick for parallelizing convolutional neural networks, A. Krizhevsky et al.

In practice, the user picks the total token count per batch as the metric that drives batching, instead of batching by sentence count. During runtime, the variable batch size is computed and the LR is adjusted respectively, based on the LR and batch size provided by the config.

Illustration of dynamic batch size, sequence length and LR

Imagine we picked a limit of 30 tokens per batch, and have set a reference lr=1e-3 for a train_batch_size=2 (in the deepspeed config). The batching algorithm for curriculum may pack the data into batches of short sentences (left) at the early stages, and batches of long sentences (right) as later stages, e.g.:

dynamic_batch_size_and_lr

Above, we collected samples until we filled up the batch with at most 30 tokens. The batch sizes (number of samples) became then 10 and 4 on the left and right examples, respectively. Using the linear scaling rule, the LR for those batches become 5e-3 and 2e-3.

Pipeline parallelism

Pipeline parallelism requires the same batch size and same sequence length across all micro-batches in a batch, as the activation sizes must be fixed between gradient accumulation steps. Between batches, these may change, and long as engine.reset_activation_shape() is called so that the new shapes are communicated on the first gradient accumulation step in the batch. Enforcing similar BxTxE between batches may lead to smaller micro-batches. As an example, below we can see an illustration of a 2-node 2-gradient-accumulation-step (ie 4 micro-batches) batching for the same dataset, when preparing data for the regular DDP (left) and for the pipeline parallelism use cases (right):

dynamic_batch_size_and_lr_microbatching

We can see that the pipeline use case (right) has the same BxTxE shape across all the 4 micro-batches in the same batch, and in order to respect that, it packs less samples in the batch, when compared to the standard use case (left hand size)

Attention Head

For an input of size BxTxE the attention has a shape of TxT for a mask of fixed size across samples of same size, or BxTxT for a different mask per sample (when samples have different sizes, as in the dataset above). This 3D attention matrix can be illustrated for the DDP microbatch 1 (picture above top-left, 4 sentences) as:

dynamic_batch_size_and_lr_attn_matrix

Note the memory savings: the attention head has a size of BxTxT, i.e. a linear memory dependency on the batch size B and quadratic memory dependency on the largest sequence length T in the (micro-) batch. Thus, supporting a dynamic size T allows for an increase of B.

PR overview

This PRs implements dynamic batching and LR scaling. The dataloader and LR scheduler necessary can be retrieved by calling get_dataloader_and_lr_scheduler_for_variable_batch_size. A small explanation of that function follows:

  • The logic behind the algorithms for LR scaling is in scale_lr;
  • The partitioning of samples into batches is done by batch_by_seqlen.
  • For pipeline parallelism, it is required that all micro-batches in a pipeline pass to have the same activation shapes. This is enabled by setting to True the following parameters:
    • required_microbatches_of_same_sizes that will force the B dimension to be the same across all gradient accumulation steps of all dataloaders on a batch;
    • required_microbatches_of_same_lengths that will force the T dimension to be the same across all gradient accumulation steps. Works by calling the user-provided sample_padding_fn(sentence, len) that pads a given sentence to the argument length;
    • batch_by_seqlen returns microbatch_sample_ids (the list of sample ids per micro-batch), batch_sizes (the size of effective batch sizes, and batch_max_seqlens (longest sequence across all microbatches in a batch)
  • dataloader_for_variable_batch_size relies on microbatch_sample_ids and will iterate/collate/pad samples for every batch and return a dataloader that iterates the final (variable-size) batches;
  • lr_scheduler_for_variable_batch_size relies on batch_sizes to compute the learning rate for each effective batch, taking into account the batch size and LR in the config file, and scaling the LR based on the size of each effective batch, and the scaling rule mentioned above (Linear, Square root, etc).
    • Special note to the lr_scheduler returned that will either accept either:
      1. an user-provided Optimizer that will scale the learning rates (in param groups) at every batch, or
      2. an user-defined LRScheduler, that in this case will first get the learning rate from the scheduler and then scale it accordingly.

Example

An example for the use case with and without pipelining is provided in deepspeed/runtime/data_pipeline/data_sampling/variable_batch_size_and_lr_example.py. The example shows an attention head with attention of variable-sized BxTxT per batch, followed by a fixed size feed forward network. These are the main blocks on a Large Language Model. The feed-forward (or linear layer) that follows the attention head requires a constant input size, equivalent to the largest sentence in the whole dataset, so the output of the attention must be padded (see feedforward: needs to convert BxTxE to BxMxE by padding extra tokens in the code).

The output of the variable_batch_size_and_lr_example is the following:

Config

The example file also comments the relevant deepspeed config with comments:

config = {
  "train_batch_size": 16,
  # `train_micro_batch_size_per_gpu` tells how many sequence packs of `max_tokens` each will be collated together.
  #  I.e. the number of tokens per micro batch (ie per gpu iteration) is `train_micro_batch_size_per_gpu`*`max_tokens`.
  "train_micro_batch_size_per_gpu": 2,
  "data_efficiency": {
    "enabled": True,
    # seed to be applied to all data efficiency modules, including dynamic batching
    "seed": 42,
    "data_sampling": {
      "num_workers": 0, # dataloader num_workers argument
      "pin_memory": False,  # dataloader pin_memory argument
      "dynamic_batching": {
        # enables or disables dynamic batching
        "enabled": True,
        # how many tokens we need to fill a pack of sequences (that will be collated together as a sample)
        "max_tokens": 100,
        # Input and output write to read from or write the length of every sequence.
        # Sequence lengths will be loaded from: {metrics_path}/seqlen/seqlen_sample_to_metric.bin and *.idx
        # If files dont exist, they'll be computed and saved on the first run, and loaded on subsequent runs.
        "metrics_path": "./curriculum_output/",
        # As batch size increases/decreses, which method to use to scale LR accordingly?
        # Options: linear, sqrt (square root), or None to disable
        "lr_scaling_method": "linear",
        # how to pick sentences to be packed into samples:
        # - dataloader: by same order as they come in with the dataloader
        # - seqlen: by sequence length (shortest to longest)
        # - random: random order using the seed in config['data_efficiency']['seed'
        "sentence_picking_order": "dataloader",  # "random" / "seqlen" / "dataloader"
        # minimum number of sequences required to reach `max_tokens`. If sentence pack is smaller, it's discarded.
        "min_batch_size": 1,
        # maximum number of sequences required to reach `max_tokens`. If sentence pack is larger, it's discarded.
        "max_batch_size": 10,
        # enable the output of microbatching information about sentence packing
        "verbose": True,
      },
    },
  },
}

Future work

A follow-up PR will enable dynamic batching when calling deepspeed.initialize. I.e. instead of this:

engine, _, _, _ = deepspeed.initialize(config=config, model=model)
dataloader, lr_scheduler, _ = get_dataloader_and_lr_scheduler_for_variable_batch_size_deepspeed(...)
engine.lr_scheduler = lr_scheduler

we'd ideally have this:

engine, _, dataloader, lr_scheduler = deepspeed.initialize(config=config, model=model)

where initialize will call internally get_dataloader_and_lr_scheduler_for_variable_batch_size_deepspeed.

YizhouZ and others added 30 commits February 8, 2025 23:03
We use simple model + deepspeed zero 3 + torch.compile and count graph
break numbers to demonstrate current status of combing deepspeed +
torch.compile.

---------

Co-authored-by: Masahiro Tanaka <[email protected]>
…ch workflow triggers (deepspeedai#6584)

Changes from deepspeedai#6472 caused the no-torch workflow that is an example of
how we build the DeepSpeed release package to fail (so we caught this
before a release, see more in deepspeedai#6402). These changes also copy the style
used to include torch in other accelerator op_builder implementations,
such as npu
[here](https://github.com/microsoft/DeepSpeed/blob/master/op_builder/npu/fused_adam.py#L8)
and hpu
[here](https://github.com/microsoft/DeepSpeed/blob/828ddfbbda2482412fffc89f5fcd3b0d0eba9a62/op_builder/hpu/fused_adam.py#L15).

This also updates the no-torch workflow to run on all changes to the
op_builder directory. The test runs quickly and shouldn't add any
additional testing burden there.

Resolves: deepspeedai#6576
Fixes deepspeedai#6585
Use shell=True for subprocess.check_output() in case of ROCm commands.
Do not use shlex.split() since command string has wildcard expansion.

Signed-off-by: Jagadish Krishnamoorthy <[email protected]>
SD workflow needed updates when we moved to pydantic 2 support that was
never added before.

Passing nv-sd workflow
[here](https://github.com/microsoft/DeepSpeed/actions/runs/11239699283)
Llama3.2-11b and llama3.2-90b including vision model and text model,
these two models have different num_kv_heads, so we need to set
num_kv_heads dynamically.

Co-authored-by: Logan Adams <[email protected]>
Disable `steps_per_print` by default.
This PR addresses deepspeedai#5818.
Instead of contiguous numbers based on the device count, this PR uses
device indices in `--include`.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
cc @tohtana

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
This PR mainly handles all places where InferenceBuilder is used to
access any op or a specific implementation for an op.
Instead an op is defined, and its proper implementation is picked inside
and the usage will be transparent to the user.
What was done in the PR:
1) Added missing ops (added a py file with fallback mechanism)
2) Added missing fallback implementations for existing ops
3) removed all usages for builder.load and replaced them with ops
instead.
4) added workspace op and inferenceContext which contains all workspace
related functions and inferenceContext is the python fallback of
inferenceContext in CUDA
5) a small change to softmax_context signature to fit the fallback
signature.

---------

Co-authored-by: Joe Mayer <[email protected]>
Co-authored-by: Lev Kurilenko <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
…eedai#6611)

HF accelerate fixes implemented in
huggingface/accelerate#3145 mean that we no
longer need to pin the Accelerate version!

nv-lightning tests now run on Ubuntu 20.04+, so we support >node 16, so
we can remove the explicit permissions for that in the env config.
Modified _replace_module in auto_tp.py :
The modification keeps the layers 'shared_expert_gate' and 'gate' in
qwen2-moe the original type torch.nn.Linear and not changes them into
LinearLayer. In this way, their weights will not be split into multiple
HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards.
Since the weights of 'gate' are not split into multiple HPU/GPU cards,
all gather operations are not needed, which may improve performance.

---------

Co-authored-by: Logan Adams <[email protected]>
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.15.2
Author           - @jomayeri

Co-authored-by: jomayeri <[email protected]>
Parameters prefetched by ZeRO3 are sometimes not used. This occurs when
the actual sub-module execution differs from previous tracing. As a
result, the state of the allgather handle for such a parameter remains
`INFLIGHT`, causing functions like `empty_partition_cache` to detect it
and throw an error.
This PR resolves the issue by ensuring that communication finishes and
the parameters are freed.

As this issue was mentioned in deepspeedai#6011, this includes the change of the
branch. We need to merge deepspeedai#6011 first.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Restoring the functionality of the cpu locked tensor in the AIO library.
Make async_io operator available for CPU accelerator, i.e., CPU only
environment.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
…eepspeedai#6541)

setting global variables during training will create a graph breaks when
using torch.compile (reading global variables doesn't). this commit
attempts to reduce the setting of global variables in the checkpointing
flows.
there are 2 main uses setting global variables:
1. Share data between functions
2. Establish that this is the first call to the code

For most of the cases the data in the global variables is data that can
be computed on demand or set once in an initial state in a configure
function.
For "check that this is the first run" use case the code was moved to
the configure function.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
This PR adds an API `deepspeed.runtime.zero.offload_states
get_state_devices`, which gets devices of offload states as suggested in
this
[comment](deepspeedai#6011 (comment)).

We could lift this up to `deepspeed.utils` but would need to resolve a
circular import: User code -> `deepspeed.utils` ->
`deepspeed.utils.offload_states` -> `deepspeed.runtime.zero` ->
`deepspeed.runtime.zero.partition_parameters` -> `deepspeed.utils`

This will require a significant refactoring as long as we have
`OffloadStateTypeEnum` in `deepspeed.runtime.zero`.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Tests with `reuse_dist_env = True` often causes memory leaks. This PR
ignores `reuse_dist_env` and forcibly sets it to `False`. This change
might slow down the tests, but I think it is better to manually restart
runners and relaunch tests.

Memory usages (See deepspeedai#6578):
- `reuse_dist_env == True`:
https://github.com/microsoft/DeepSpeed/actions/runs/11302940871/job/31439471512
- `reuse_dist_env == False`:
https://github.com/microsoft/DeepSpeed/actions/runs/11303250613/job/31440137894
This PR extends deepspeedai#6570 by
showing a breakdown of graph breaks. So we can see how graph breaks are
distributed among different reasons. An example of graph break output
can be seen from the following workflow run
https://github.com/microsoft/DeepSpeed/actions/runs/11199157962
)

This patch fixes issue deepspeedai#4460.
When `btl_tcp_if_include` option is provided through `--launcher_args`,
we use the provided option instead of the hardcoded `--mca
btl_tcp_if_include eth0`. Otherwise we use `--mca btl_tcp_if_include
eth0` as the default for compatibility.

Fixes deepspeedai#4460

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Some (not all) of the LR schedulers in runtime were missing the
initialization of the optimizer group lr.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
## Feature

This commit implements the following features:

- [x] support saving checkpoint as safetensors (more commonly used
format)
- [x] support sharding checkpoints (which is important for very large
models)

Most of the codes are borrowed from
https://github.com/huggingface/transformers/blob/v4.45.1/src/transformers/modeling_utils.py#L2490

## Usage

For `pytorch_model.bin` export
```
python zero_to_fp32.py . output_dir/
```

For  `model.safetensors` export
```
python zero_to_fp32.py . output_dir/ --safe_serialization
```

---------

Co-authored-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
…eepspeedai#6496)

adding an option to disable calls for logger while compiling to avoid
graph breaks. Here I used an environment variable to determine whether
to activate this option, but it can also be determined using the json
config file or any other way you see fit.

---------

Co-authored-by: snahir <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
The error in the following log suggests that the cache file for HF model
list can be broken:

https://github.com/microsoft/DeepSpeed/actions/runs/11343665365/job/31546708118?pr=6614

The actual cause of the above error is unclear, but `_hf_model_list`
potentially breaks the cache file when it is concurrently called from
multiple processes. This PR locks the cache file to ensure
`_hf_model_list` safely reads and writes the file.
siqi654321 and others added 11 commits February 20, 2025 20:46
Description
This PR includes Tecorigin SDAA accelerator support.
With this PR, DeepSpeed supports SDAA as backend for training tasks.

---------

Signed-off-by: siqi <[email protected]>
Co-authored-by: siqi <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
More information on libuv in pytorch:
https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html
Issue tracking the prevalence of the error on Windows (unresolved at the
time of this PR): pytorch/pytorch#139990
LibUV github: https://github.com/libuv/libuv

Windows error:
```
  File "C:\hostedtoolcache\windows\Python\3.12.7\x64\Lib\site-packages\torch\distributed\rendezvous.py", line 189, in _create_c10d_store
    return TCPStore(
           ^^^^^^^^^
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
```

use_libuv isn't well supported on Windows in pytorch <2.4, so we need to
guard around this case.

---------

Signed-off-by: Logan Adams <[email protected]>
@fukun07 and I discovered a bug when using the `offload_states` and
`reload_states` APIs of the Zero3 optimizer. When using grouped
parameters (for example, in weight decay or grouped lr scenarios), the
order of the parameters mapping in `reload_states`
([here](https://github.com/deepspeedai/DeepSpeed/blob/14b3cce4aaedac69120d386953e2b4cae8c2cf2c/deepspeed/runtime/zero/stage3.py#L2953))
does not correspond with the initialization of `self.lp_param_buffer`
([here](https://github.com/deepspeedai/DeepSpeed/blob/14b3cce4aaedac69120d386953e2b4cae8c2cf2c/deepspeed/runtime/zero/stage3.py#L731)),
which leads to misaligned parameter loading. This issue was overlooked
by the corresponding unit tests
([here](https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/runtime/zero/test_offload_states.py)),
so we fixed the bug in our PR and added the corresponding unit tests.

---------

Signed-off-by: Wei Wu <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Following changes in Pytorch trace rules , my previous PR to avoid graph
breaks caused by logger is no longer relevant. So instead I've added
this functionality to torch dynamo -
pytorch/pytorch@16ea0dd
This commit allows the user to config torch to ignore logger methods and
avoid associated graph breaks.

To enable ignore logger methods -
os.environ["DISABLE_LOGS_WHILE_COMPILING"] = "1"
To ignore logger methods except for a specific method / methods (for
example, info and isEnabledFor) -
os.environ["DISABLE_LOGS_WHILE_COMPILING"] = "1"
and os.environ["LOGGER_METHODS_TO_EXCLUDE_FROM_DISABLE"] = "info,
isEnabledFor"

Signed-off-by: ShellyNR <[email protected]>
Co-authored-by: snahir <[email protected]>
The partition tensor doesn't need to move to the current device when
meta load is used.

Signed-off-by: Lai, Yejing <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
…t` (deepspeedai#7069)

With future changes coming to pip/python/etc, we need to modify to no
longer call `python setup.py ...` and replace it instead:
https://packaging.python.org/en/latest/guides/modernize-setup-py-project/#should-setup-py-be-deleted


![image](https://github.com/user-attachments/assets/ea39ef7b-3cbe-4916-86f0-bc46a5fce96d)

This means we need to install the build package which is added here as
well.

Additionally, we pass the `--sdist` flag to only build the sdist rather
than the wheel as well here.

---------

Signed-off-by: Logan Adams <[email protected]>
Add deepseekv3 autotp.

Signed-off-by: Lai, Yejing <[email protected]>
@loadams
Copy link
Collaborator

loadams commented Feb 26, 2025

Thanks @bm-synth - can you take a look at the formatting and DCO errors?

Hi @bm-synth - following up on the formatting fixes?

loadams and others added 7 commits February 26, 2025 15:56
Latest transformers causes failures when cpu-torch-latest test, so we
pin it for now to unblock other PRs.

---------

Signed-off-by: Logan Adams <[email protected]>
These jobs haven't been run in a long time and were originally used when
compatibility with torch <2 was more important.

Signed-off-by: Logan Adams <[email protected]>
@bm-synth bm-synth requested a review from hwchen2017 as a code owner March 2, 2025 19:24
@bm-synth
Copy link
Contributor Author

bm-synth commented Mar 2, 2025

@loadams done. Fixed formatting workflow:

  • issues related to formatting fixed in ed1f78938bdca90ab11f8abb0a24c42c8745c6c5;
  • flake8 error W604 backticks are deprecated, use 'repr()' in f07c3c42caa6be9b53df92c9e6ac49f7eb7a9f5a;
  • check-torchcuda pre-commit hook error related to calling torch.cuda instead of get_accelerator() in b4a08b6c44bb7db8fd8d257564b4ca394232bbf6;

Ran:

  pip show pre-commit clang-format
  pre-commit run --all-files

All pre-commit hooks passed.

@bm-synth bm-synth requested a review from jomayeri as a code owner March 2, 2025 19:57
bm-synth added 3 commits March 2, 2025 20:02
…nstead of 'get_accelerator()'

Signed-off-by: Bruno Magalhaes <[email protected]>
…Speed into variable_batch_size_and_lr

Signed-off-by: Bruno Magalhaes <[email protected]>
@bm-synth bm-synth force-pushed the variable_batch_size_and_lr branch from 477733f to 4e20c47 Compare March 2, 2025 20:07
@bm-synth bm-synth closed this Mar 3, 2025
@bm-synth bm-synth force-pushed the variable_batch_size_and_lr branch from 4e20c47 to e08d369 Compare March 3, 2025 02:11
bm-synth added a commit to bm-synth/DeepSpeed that referenced this pull request Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.