[WIP] max-autotune #2393

krammnic · 2025-02-13T10:54:05Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?

max-autotune, expose max-autotune in configs for better perf #2373

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

Not for review right now.

pytorch-bot · 2025-02-13T10:54:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2393

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

krammnic · 2025-02-13T10:54:39Z

Not for review right now

krammnic · 2025-02-13T10:56:46Z

qwen2.5 3B full max-autotune: false, compile: true

krammnic · 2025-02-13T19:27:57Z

qwen2.5 3B lora max-autotune: false, compile: true

krammnic · 2025-02-18T10:48:10Z

Had to add torch.compiler.cudagraph_mark_step_begin() as it failed with weird error without it.

krammnic · 2025-02-18T11:03:55Z

It compiled ~16 minutes with max-autotune: True, loss became nan and I assume that there is no real speedup

krammnic · 2025-02-18T11:07:25Z

Ah and all failed with:

RuntimeError: These live storage data ptrs are in the cudagraph pool but not accounted for as an output of cudagraph trees: 

Data Pointer: 140125209566720, history: ```

krammnic · 2025-02-18T20:36:01Z

Repro:

manual: fork torchfune:  https://github.com/pytorch/torchtune

then:
git clone https://github.com/<YOUR_GITHUB_USER>/torchtune.git
cd torchtune
git remote add krammnic https://github.com/krammnic/torchtune.git
git remote add upstream https://github.com/pytorch/torchtune.git
git fetch krammnic
git checkout -b max-autotune krammnic/max-autotune

conda create --name max-autotune python=3.11
conda activate max-autotune

pip3 install --pre --upgrade torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/cu126
pip3 install -e .
tune download meta-llama/Llama-3.2-1B-Instruct --output-dir /tmp/Llama-3.2-1B-Instruct --ignore-patterns "original/consolidated.00.pth"

tune cp llama3_2/1B_lora_single_device .
CUDA_VISIBLE_DEVICES=0 tune run lora_finetune_single_device --config 1B_lora_single_device.yaml max_autotune=True compile=True

krammnic · 2025-02-18T20:45:29Z

Findings:

Works only with max-autotune for compiling flex attention
For the max-autotune for model compiling without torch.compiler.cudagraph_mark_step_begin():

RuntimeError: Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run. Stack trace: [Could not find stack trace]. To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation.

For the max-autotune for model compiling with torch.compiler.cudagraph_mark_step_begin(): loss is nan
For the max-autotune for loss + flex compiling, no model, warning:

packages/torch/_inductor/cudagraph_trees.py:2345: UserWarning: Unable to hit fast path of CUDAGraphs because of pending, uninvoked backwards. Consider running with torch.no_grad() or using torch.compiler.cudagraph_mark_step_begin() before each model invocation

Then after 3 step:

RuntimeError: These live storage data ptrs are in the cudagraph pool but not accounted for as an output of cudagraph trees: 

Data Pointer: 140442444234752, history:

For loss, model, flex compiling - same as 4

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 13, 2025

max-autotune

f3cc535

krammnic force-pushed the max-autotune branch from 40f2281 to f3cc535 Compare February 18, 2025 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] max-autotune #2393

[WIP] max-autotune #2393

krammnic commented Feb 13, 2025

pytorch-bot bot commented Feb 13, 2025

krammnic commented Feb 13, 2025

krammnic commented Feb 13, 2025

krammnic commented Feb 13, 2025

krammnic commented Feb 18, 2025

krammnic commented Feb 18, 2025 •

edited

Loading

krammnic commented Feb 18, 2025 •

edited

Loading

krammnic commented Feb 18, 2025 •

edited by felipemello1

Loading

krammnic commented Feb 18, 2025

[WIP] max-autotune #2393

Are you sure you want to change the base?

[WIP] max-autotune #2393

Conversation

krammnic commented Feb 13, 2025

Context

Changelog

Test plan

UX

pytorch-bot bot commented Feb 13, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2393

krammnic commented Feb 13, 2025

krammnic commented Feb 13, 2025

krammnic commented Feb 13, 2025

krammnic commented Feb 18, 2025

krammnic commented Feb 18, 2025 • edited Loading

krammnic commented Feb 18, 2025 • edited Loading

krammnic commented Feb 18, 2025 • edited by felipemello1 Loading

krammnic commented Feb 18, 2025

krammnic commented Feb 18, 2025 •

edited

Loading

krammnic commented Feb 18, 2025 •

edited

Loading

krammnic commented Feb 18, 2025 •

edited by felipemello1

Loading