Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop forcing approximation of math.erf with --iree-codegen-gpu-native-math-precision #20074

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bjacob
Copy link
Contributor

@bjacob bjacob commented Feb 24, 2025

The implementation of the --iree-codegen-gpu-native-math-precision had a longstanding bug where both branches (for that option being true or false) were performing approximation of the math.erf function.

Recent refactorings preserved that behavior bug-for-bug.

Unfortunately, that meant that users passing --iree-codegen-gpu-native-math-precision were not getting faster math.erf, even on ROCm where we enabled the native call by default.

when --iree-codegen-gpu-native-math-precision is
passed.

Signed-off-by: Benoit Jacob <[email protected]>
@bjacob bjacob marked this pull request as ready for review February 24, 2025 16:57
@bjacob bjacob requested a review from hanhanW as a code owner February 24, 2025 16:57
@bjacob bjacob requested review from kuhar and Groverkss February 24, 2025 16:57
Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I benchmarked this and confirm it helps with perf

@bjacob
Copy link
Contributor Author

bjacob commented Feb 24, 2025

Summary of debugging of the CI failures so far:

The difference in IR without/with this PR is exactly as expected: it causes us to use the __ocml_erf_f16 function instead of the polynomial approximation. https://gist.github.com/bjacob/247e46bc587f2c5e089fe67d8897fb49

The implementation of __ocml_erf_f16 is also embedded in the above IR diff. It uses a different approximation, but that is just as accurate. Both are accurate to < 1e-7 so shouldn't explain a large numerical difference like we are seeing here.

There is however one major difference:

  • Without this PR, we are performing the approximation in f16.
  • With this PR, the __ocml_erf_f16 function which we are calling is upcasting to f32 and performing the approximation in f32 and casting the result down to f16. It's all in the above IR diff.

So I actually think that the code with this PR is actually more accurate.

It is a little surprising that the numerical different e2e is so large, but we need assessment of the actual accuracy metric, not those direct output-activations comparisons.

@nithinsubbiah
Copy link
Collaborator

I verified the numerics with this patch by comparing the image we generate from SDXL and it looks good. We need to update the golden values for SDXL in CI with this update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants