Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GLC: Mixed up ports in UOPS_DISPATCHED.PORT_X event #149

Open
JanLJL opened this issue Mar 6, 2024 · 9 comments
Open

GLC: Mixed up ports in UOPS_DISPATCHED.PORT_X event #149

JanLJL opened this issue Mar 6, 2024 · 9 comments
Assignees

Comments

@JanLJL
Copy link

JanLJL commented Mar 6, 2024

I believe there is a mistake in the documentation of the incore events of SPR, specifically UOPS_DISPATCHED.PORT_2_3_10 and UOPS_DISPATCHED.PORT_5_11 are mixed up and the first one should count the events of dispatched uops on ports 2, 3, and 11 while the latter should count the dispatched uops on ports 5 and 10.

Based on the Intel Architectures Optimization Reference, we can see on page 62/63 that port 10 (p10) adds a simple integer ALU while port 11 (p11) is used for loading data and address generation.
Seeing in the documentation that apparently there is an event counting the load uops on p2 and p3, but not the load uops on p11 and rather the uops of a port used for integer arithmetic made me doubt.

So I added hardware performance counters (using likwid) to a simple benchmark code measuring an ADD on 32-bit general purpose registers, such as add r9d, r10d, where I am sure it should run on all ALU ports, i.e., p0, p1, p5, p6, and p10.

I counted the dispatched uops and - as a metric - print out the ratio of the overall dispatched uops to get a percentage number and 100 for port 0 would mean, all dispatched uops were dispatched on p0.

Instructions per loop: (32 add + 1 inc + 1 cmp + 1 jl) = 35 instructions (apparently there is no macro fusion happening because of the jl)
We do 1,000,000 iterations --> 35,000,000,000 uops

+----------------------------------+---------+------------+
|               Event              | Counter | HWThread 0 |
+----------------------------------+---------+------------+
|         INSTR_RETIRED_ANY        |  FIXC0  |   35006880 |
|       CPU_CLK_UNHALTED_CORE      |  FIXC1  |    7049300 |
|       CPU_CLK_UNHALTED_REF       |  FIXC2  |    7048320 |
|    UOPS_DISPATCHED_PORT_PORT_0   |   PMC0  |    6675973 |
|    UOPS_DISPATCHED_PORT_PORT_1   |   PMC1  |    6719374 |
| UOPS_DISPATCHED_PORT_PORT_2_3_10 |   PMC2  |       3076 |
|   UOPS_DISPATCHED_PORT_PORT_4_9  |   PMC3  |       1405 |
|  UOPS_DISPATCHED_PORT_PORT_5_11  |   PMC4  |   13607280 |
|    UOPS_DISPATCHED_PORT_PORT_6   |   PMC5  |    7005328 |
|   UOPS_DISPATCHED_PORT_PORT_7_8  |   PMC6  |       1345 |
+----------------------------------+---------+------------+

+------------------------+------------+
|         Metric         | HWThread 0 |
+------------------------+------------+
|   Runtime (RDTSC) [s]  |     0.0035 |
|  Runtime unhalted [s]  |     0.0035 |
|       Clock [MHz]      |  2000.2744 |
|           CPI          |     0.2014 |
|          Port0         |        100 |
|    Port 0 occupation   |    19.6273 |
|    Port 1 occupation   |    19.7549 |
| Port 2/3/10 occupation |     0.0090 |
|   Port 4/9 occupation  |     0.0041 |
|  Port 5/11 occupation  |    40.0052 |
|    Port 6 occupation   |    20.5956 |
|   Port 7/8 occupation  |     0.0040 |
+------------------------+------------+

We can see that p0, p1, and p6 are occupied 20% of the time, while p5/11 shows 40% occupancy.
Since p11 is used for loads and we are not loading any data in the benchmark, this either means
a) p10 is not used at all - even though it has an ALU - and that the instruction is scheduled twice as many times on p5, or
b) p10 and p11 should be actually swapped and each of the five ALU ports gets 20% of the dispatched uops, which I think is the case and makes more sense.

Could you please confirm this and, if verified, change the documentation accordingly?

Thanks and best,
Jan

@edwarddavidbaker
Copy link
Contributor

@JanLJL Thank you for filing a very detailed issue!
@vdaneti Please review the above notes and compare to SPR checkout data.

@edwarddavidbaker
Copy link
Contributor

@vdaneti Did you receive documentation feedback on GLC ports 10 and 11?

https://cdrdv2.intel.com/v1/dl/getContent/671488 - Figure 2-2 and Table 2-3
image

@edwarddavidbaker edwarddavidbaker changed the title Mixed up ports in UOPS_DISPATCHED.PORT_X event GLC: Mixed up ports in UOPS_DISPATCHED.PORT_X event May 6, 2024
@JanLJL
Copy link
Author

JanLJL commented Nov 6, 2024

Any updates on this one?

@vdaneti
Copy link

vdaneti commented Nov 6, 2024

@edwarddavidbaker please reference the updated arch doc here

@edwarddavidbaker
Copy link
Contributor

@edwarddavidbaker
Copy link
Contributor

@boomanaiden154 Thanks for opening a ticket and linking the LLVM issue. We are determining the best method to implement documentation updates. I apologize for the delays.

@boomanaiden154
Copy link

We are determining the best method to implement documentation updates. I apologize for the delays.

All good on the timing. Everything is stable on our end, if a bit inconsistent. Given the plan is to update the documentation, it seems like the resolution was that perfmon was correct and the diagrams in the optimization manual need to have ports 10 and 11 swapped?

@edwarddavidbaker
Copy link
Contributor

We are determining the best method to implement documentation updates. I apologize for the delays.

All good on the timing. Everything is stable on our end, if a bit inconsistent. Given the plan is to update the documentation, it seems like the resolution was that perfmon was correct and the diagrams in the optimization manual need to have ports 10 and 11 swapped?

Correct. Ports 10 and 11 need to be swapped in documentation for Golden Cove.

boomanaiden154 added a commit to boomanaiden154/llvm-project that referenced this issue Nov 24, 2024
Based on intel/perfmon#149, the documentation is incorrect and the pfm counter
names are actually correct. This patch adjusts the SapphireRapids scheduling
model to match the performance counter naming/ correct naming that will soon be
reflected in the optimization manual.

This fixes part of llvm#117360.
boomanaiden154 added a commit to boomanaiden154/llvm-project that referenced this issue Nov 24, 2024
Based on intel/perfmon#149, the documentation is
incorrect and the pfm counter names are actually correct. This patch adjusts
the Alder Lake scheduling model to match the performance counter naming/
correct naming that will soon be reflected in the optimization manual.

This fixes part of llvm#117360.
boomanaiden154 added a commit to llvm/llvm-project that referenced this issue Nov 25, 2024
Based on intel/perfmon#149, the documentation is incorrect and the pfm
counter names are actually correct. This patch adjusts the
SapphireRapids scheduling model to match the performance counter naming/
correct naming that will soon be reflected in the optimization manual.

This fixes part of #117360.
boomanaiden154 added a commit to llvm/llvm-project that referenced this issue Nov 25, 2024
Based on intel/perfmon#149, the documentation
is incorrect and the pfm counter names are actually correct. This patch
adjusts the Alder Lake scheduling model to match the performance counter
naming/ correct naming that will soon be reflected in the optimization
manual.

This fixes part of #117360.
@HaohaiWen
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants