[WIP] Initial implementation of AMP support #707

BenjaminBossan · 2020-10-03T15:53:56Z

Intro

This feature is about adding support for automatic mixed precision (AMP, solves #611). The current state should already be working but is untested so far; it is still missing docs and tests.

Testing AMP

Unfortunately, I cannot test whether this works or not. Looking at colab, they mention that one gets one of

Nvidia K80s, T4s, P4s and P100s

but it seems AMP requires Turing, Volta, or Ampere. Could someone else run the test? I updated the examples/benchmarks/mnist.py script with a new --amp_enabled argument. Running it once with and without AMP, the acceleration for skorch should be similar to the acceleration for pure PyTorch (not sure if we can expect a big acceleration with the simple architecture being used).

Progress

See here

Design

My implementation so far uses the most straightforward way of implementing AMP support. However, I want to keep this a draft until we finalize the design. Parts of the code that I don't really like:

There is now a net.grad_scaler_ attribute, analogue to net.module_ etc. But it is None if not amp_enabled, which is unlike the other such attributes, which are never None. This requires some checks down the line (if val is not None).
Similarly, f_grad_scaler on Checkpoint et al. is also None, unlike the other parameters.
Inside train_step, there is now:

        if not self.amp_enabled:
            self.optimizer_.step(step_fn)
        else:
            step_fn()  # closure not (yet) supported with AMP
            self.grad_scaler_.step(self.optimizer_)
            self.grad_scaler_.update()

        return step_accumulator.get_step()

This is ugly and also requires awareness of everyone who overrides train_step. If there is custom code out there that already overrides train_step, it will not work with amp_enabled unless adjusted (in fact, it will fail silently).

Similar argument for get_loss:

        with self.autocast():
            loss = self.criterion_(y_pred, y_true)
        return self.grad_scaler_.scale(loss) if self.amp_enabled else loss

Both get_loss and train_step should ideally be easy to override, this change makes it harder.

When someone overrides train_step_single and no longer uses infer and get_loss, AMP will not be applied correctly.

I believe that none of these issues are show stoppers, AMP support is important enough that we should accept some increased complexity. But maybe we can come up with a superior design that doesn't sacrifice as much. E.g., we could think about using a facade pattern to hide some of the if ... else ugliness above and to not have net.grad_scaler_ as None, but that would also make the code more opaque. Anyway, I'm up for suggestions.

This feature is about adding support for automatic mixed precision (issue 611). The current state should already be working but is untested so far; it is still missing docs and tests.

BenjaminBossan · 2020-10-04T15:06:24Z

Just as a flavor of how a facade optimizer could look like:

class AmpOptimizerFacade(torch.optim.Optimizer):
    def __init__(self, optimizer, grad_scaler, amp_enabled=True):
        self.optimizer = optimizer
        self.grad_scaler = grad_scaler
        self.amp_enabled = amp_enabled

    def step(self, closure):
        if not self.amp_enabled:
            # just use optimizer as normal
            return self.optimizer.step(closure)

        loss = None
        if closure is not None:
            with torch.enable_grad():
                # Closure is responsible for scaling the loss, no
                # scaling here; the reason is that loss.backward() is
                # called within the closure, at which point the
                # scaling already needs to be applied, so applying it
                # here would be too late.
                loss = closure()

        self.grad_scaler.step(self.optimizer)
        self.grad_scaler.update()
        return loss

    def __getattr__(self, attr, default=None):
        return getattr(self.optimizer, attr, default=default)

    def __repr__(self):
        # something useful

The idea would be to set net.optimizer_ to this object in case that AMP is enabled (otherwise leave it as is). Then we can get rid of issue 3 mentioned above. Unfortunately, we cannot perform the loss scaling inside this wrapper (issue 4), because loss.backward() is called inside the closure, but the scaling should happen before loss.backward() is called, thus scaling inside the wrapper would be too late.

thomasjpfan · 2020-10-08T16:34:45Z

There is now a net.grad_scaler_ attribute, analogue to net.module_ etc. But it is None if not amp_enabled, which is unlike the other such attributes, which are never None. This requires some checks down the line (if val is not None).

We can make the assumption that amp_enabled==False => net.grad_scaler_ is None and use self.amp_enabled everywhere.

Here are some ideas:

We can extend the callback api to have a "before_backward" and "after_backward" callback, something like fastai's MixedPrecision Callback. But this will still require subclasses to call trigger the callbacks in the right places.
We keep the ideas in this PR and document places that need to be adjusted to make amp work for subclasses.
If we want to extend the facade idea, we can also facade the criterion_ and module_ to do the correct thing when amp is activated. This would make things extra opaque, but it make allow subclasses to not worry about amp and get it "for free".

BenjaminBossan · 2020-10-10T14:25:16Z

Thanks for taking a look @thomasjpfan

We can make the assumption that amp_enabled==False => net.grad_scaler_ is None and use self.amp_enabled everywhere.

That's what I meant with extra checks. It's not the end of the world, but still annoying. But probably better than setting a dummy object, which might confuse people when they save_params and suddenly there is a grad scaler file despite not using AMP.

We can extend the callback api to have a "before_backward" and "after_backward" callback... But this will still require subclasses to call trigger the callbacks in the right places.

Yes, e.g. when users override train_step_single, they need to remember to call those. Also, I wonder if it's good to "break up" the training loop further and further by invoking more and more methods -- it could make the overall flow hard to understand.

We keep the ideas in this PR and document places that need to be adjusted to make amp work for subclasses.

This seems to be the most conservative approach. We'd probably also need an upgrade guide for people who already overrode affected methods, to remind them that certain methods now need to perform extra duty. We could put this into CHANGES.md, but it's hard for me to tell if this is getting read.

If we want to extend the facade idea, we can also facade the criterion_ and module_ to do the correct thing when amp is activated.

Could you elaborate on this suggestion? What would you change to those?

something like fastai's MixedPrecision Callback

MixedPrecision seems to predate the PyTorch native AMP support, so instead I looked at NativeMixedPrecision:

https://github.com/fastai/fastai/blob/72590db2e66af6dd0eaa8c8874a80f11a4b8cbc2/fastai/callback/fp16.py#L154-L167

(btw. trying to understand fastai code was as delightful as ever, with those strange decorators, patches everywhere, and lots of import * ^^ )

    def before_batch(self): self.autocast.__enter__()
    def after_loss(self): self.autocast.__exit__()

So in fastai, autocast is entered on_batch_begin and exited after get_loss (skorch nomenclature). This is thus a "broader" approach, i.e. a lot can happen within the autocast context, whereas I chose to only apply it precisely where it's needed (when module_ and criterion_ are called). What is better?

Also, I wonder if fastai's after_loss is invoked after a prediction is made. I tried to understand get_preds but failed unfortunately. If it's not called, that seems to be wrong: autocast is entered when the prediction starts and never exited.

Moreover, I wonder if the implementation using a callback cannot lead to bugs introduced by using callbacks in the wrong order. I'd feel more comfortable having more control over that by leaving the logic inside NeuralNet tbh.

        self.learn._step,self.learn._backward = self._step,self._backward

This looks dangerous to me. self.learn seems to be the trainer object (like NeuralNet) and just overriding methods ad hoc could lead to surprising results.

Finally, I don't see that fastai adjusts their gradient clipping to AMP.

Okay, I'm ranting a bit, but my conclusion for now is that we shouldn't lean too much on the way fastai implements AMP.

Since unscaling is not an idempotent operation, PyTorch raises a RuntimeError if an unscaled optimizer is unscaled again. Since we could have multiple callbacks or other components that might want to unscale, we have to protect against this possibility. Therefore, we check if the optimizer has already been unscaled before unscaling. Unfortunately, we have to use a privat attribute in PyTorch for this, so this is prone to break in the future.

ottonemo

I have spent some time evaluating different options and this one is still the one that seems most appropriate (with the proposed modification). There is no transparent way of implementing this using callbacks and hiding the optimizer/criterion behind a facade makes the whole process too opaque for my taste.

ottonemo · 2021-01-15T16:44:45Z

examples/benchmarks/mnist.py

+        else:
+            scaler.scale(loss).backward()
+            scaler.step(optimizer)
+            scaler.update()


Since PyTorch XLA requires to wrap the step of the optimizer as well (and this is a feature me might want to support in the future as well once TPUs become more accessible for smaller companies) I suggest that we introduce something akin to self.optimizer_step(optimizer) which sorts stuff like AMP scaling and XLA optimizations.

I'd like to give this a try in the context of this PR to see if it makes things more ergonomic.

BenjaminBossan · 2022-03-20T13:27:55Z

Closed in favor of #826

[WIP] Initial implementation of AMP support

740d7ea

This feature is about adding support for automatic mixed precision (issue 611). The current state should already be working but is untested so far; it is still missing docs and tests.

BenjaminBossan self-assigned this Oct 3, 2020

BenjaminBossan requested review from thomasjpfan and ottonemo October 3, 2020 15:54

BenjaminBossan removed their assignment Oct 3, 2020

BenjaminBossan added the enhancement label Oct 3, 2020

BenjaminBossan self-assigned this Oct 3, 2020

BenjaminBossan linked an issue Oct 3, 2020 that may be closed by this pull request

Native automatic mixed precision for Skorch #611

Closed

Skip AMP test if PyTorch version below 1.6

15cce84

BenjaminBossan marked this pull request as ready for review November 21, 2020 15:55

ottonemo reviewed Jan 15, 2021

View reviewed changes

BenjaminBossan added 2 commits January 16, 2021 17:23

Merge branch 'master' into feature/support-automatic-mixed-precision

f4b24b9

Using cuda + amp in MNIST example makes no sense

b335b8c

BenjaminBossan mentioned this pull request Jan 16, 2021

Refactor train loop for easier customization #699

Merged

BenjaminBossan mentioned this pull request Apr 25, 2021

Consider integrating Huggingface Accelerate #760

Closed

BenjaminBossan closed this Mar 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Initial implementation of AMP support #707

[WIP] Initial implementation of AMP support #707

BenjaminBossan commented Oct 3, 2020 •

edited

Loading

BenjaminBossan commented Oct 4, 2020

thomasjpfan commented Oct 8, 2020

BenjaminBossan commented Oct 10, 2020

ottonemo left a comment

ottonemo Jan 15, 2021

BenjaminBossan Jan 16, 2021

BenjaminBossan commented Mar 20, 2022

[WIP] Initial implementation of AMP support #707

[WIP] Initial implementation of AMP support #707

Conversation

BenjaminBossan commented Oct 3, 2020 • edited Loading

Intro

Testing AMP

Progress

Design

BenjaminBossan commented Oct 4, 2020

thomasjpfan commented Oct 8, 2020

BenjaminBossan commented Oct 10, 2020

ottonemo left a comment

Choose a reason for hiding this comment

ottonemo Jan 15, 2021

Choose a reason for hiding this comment

BenjaminBossan Jan 16, 2021

Choose a reason for hiding this comment

BenjaminBossan commented Mar 20, 2022

BenjaminBossan commented Oct 3, 2020 •

edited

Loading