-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GLC: Mixed up ports in UOPS_DISPATCHED.PORT_X event #149
Comments
@vdaneti Did you receive documentation feedback on GLC ports 10 and 11? https://cdrdv2.intel.com/v1/dl/getContent/671488 - Figure 2-2 and Table 2-3 |
Any updates on this one? |
@edwarddavidbaker please reference the updated arch doc here |
Re-assigning to myself as a reminder to link |
@boomanaiden154 Thanks for opening a ticket and linking the LLVM issue. We are determining the best method to implement documentation updates. I apologize for the delays. |
All good on the timing. Everything is stable on our end, if a bit inconsistent. Given the plan is to update the documentation, it seems like the resolution was that perfmon was correct and the diagrams in the optimization manual need to have ports 10 and 11 swapped? |
Correct. Ports 10 and 11 need to be swapped in documentation for Golden Cove. |
Based on intel/perfmon#149, the documentation is incorrect and the pfm counter names are actually correct. This patch adjusts the SapphireRapids scheduling model to match the performance counter naming/ correct naming that will soon be reflected in the optimization manual. This fixes part of llvm#117360.
Based on intel/perfmon#149, the documentation is incorrect and the pfm counter names are actually correct. This patch adjusts the Alder Lake scheduling model to match the performance counter naming/ correct naming that will soon be reflected in the optimization manual. This fixes part of llvm#117360.
Based on intel/perfmon#149, the documentation is incorrect and the pfm counter names are actually correct. This patch adjusts the SapphireRapids scheduling model to match the performance counter naming/ correct naming that will soon be reflected in the optimization manual. This fixes part of #117360.
Based on intel/perfmon#149, the documentation is incorrect and the pfm counter names are actually correct. This patch adjusts the Alder Lake scheduling model to match the performance counter naming/ correct naming that will soon be reflected in the optimization manual. This fixes part of #117360.
Another mistake is Intel GoldenCove instruction tpt/lat in: https://www.intel.com/content/www/us/en/content-details/723498/intel-processors-and-processor-cores-based-on-golden-cove-microarchitecture-instruction-throughput-and-latency.html |
I believe there is a mistake in the documentation of the incore events of SPR, specifically UOPS_DISPATCHED.PORT_2_3_10 and UOPS_DISPATCHED.PORT_5_11 are mixed up and the first one should count the events of dispatched uops on ports 2, 3, and 11 while the latter should count the dispatched uops on ports 5 and 10.
Based on the Intel Architectures Optimization Reference, we can see on page 62/63 that port 10 (p10) adds a simple integer ALU while port 11 (p11) is used for loading data and address generation.
Seeing in the documentation that apparently there is an event counting the load uops on p2 and p3, but not the load uops on p11 and rather the uops of a port used for integer arithmetic made me doubt.
So I added hardware performance counters (using likwid) to a simple benchmark code measuring an ADD on 32-bit general purpose registers, such as
add r9d, r10d
, where I am sure it should run on all ALU ports, i.e., p0, p1, p5, p6, and p10.I counted the dispatched uops and - as a metric - print out the ratio of the overall dispatched uops to get a percentage number and
100
for port 0 would mean, all dispatched uops were dispatched on p0.Instructions per loop: (32
add
+ 1inc
+ 1cmp
+ 1jl
) = 35 instructions (apparently there is no macro fusion happening because of thejl
)We do 1,000,000 iterations --> 35,000,000,000 uops
We can see that p0, p1, and p6 are occupied 20% of the time, while p5/11 shows 40% occupancy.
Since p11 is used for loads and we are not loading any data in the benchmark, this either means
a) p10 is not used at all - even though it has an ALU - and that the instruction is scheduled twice as many times on p5, or
b) p10 and p11 should be actually swapped and each of the five ALU ports gets 20% of the dispatched uops, which I think is the case and makes more sense.
Could you please confirm this and, if verified, change the documentation accordingly?
Thanks and best,
Jan
The text was updated successfully, but these errors were encountered: