Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] AMD family 25 model 7 wrong regwidth #662

Open
brinkcoder opened this issue Feb 4, 2025 · 5 comments
Open

[BUG] AMD family 25 model 7 wrong regwidth #662

brinkcoder opened this issue Feb 4, 2025 · 5 comments
Labels

Comments

@brinkcoder
Copy link

We use AMD EPYC 9254 and 9454 in our cluster (both are Family 25, Model 15, Zen4). In power.c, for all ZEN4_EPYC processors the code sets

power_info.statusRegWidth = 64;

However, at least these two models only have a register width of 32. As a result, we observe a counter wrap every few hours, and the wrap value calculated with 64 bits yields incorrect power results.

I’m not sure how to best distinguish these processors in the code. A patch to correctly set the status register width to 32 for these models would be appreciated.

@brinkcoder brinkcoder added the bug label Feb 4, 2025
@TomTheBear
Copy link
Member

Can you provide links to the official docs for these chips?

Please attach the output of /proc/cpuinfo of one CPU core per system so we can see how to differentiate the different types of Zen4.

@brinkcoder
Copy link
Author

Ok, this is super weird. The official document says it is a 64 bit register (page 253):
55901_B2_pub_1.pdf
The cpuinfos:
cpuinfo_9254.txt
cpuinfo_9454.txt

But I added a function to perfmon.c so that I can see the raw values in likwidMetric.go:
diff_perfmon.c.txt
diff_likwidMetric.go.txt

You can see a wrap on thread 6 at 32 bit:
raw_counters.txt

[root@cpu001 cc-metric-collector]# grep "thread 6" raw_counters.txt

Raw counter value for event PWR0 on thread 6: 4289021270.000000

Raw counter value for event PWR0 on thread 6: 1696099.000000

I am clueless how to this can happen.

@TomTheBear
Copy link
Member

It might be a documentation issue. Unfortunately, there is no other setting visible in the Linux kernel sources.

@brinkcoder
Copy link
Author

I found the issue. statusRegWidth is correctly set to 64 for our ZEN4_EPYC.

The underlying issue is in likwid.h, lines 1694-1698, the typedef for PowerData. How could the 64bit counter be stored in a uint32_t?
That's why I observed a wrap-around at 32 bits every few hours. I changed it to uint64_t, and the problem is solved.

I'm not sure if it should generally be set to uint64_t or if you'd prefer using an ifdef (if that's even possible, as I'm not very familiar with C).

@TomTheBear
Copy link
Member

Good find. It shouldn't be a big deal to update the PowerData to 64 bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants