Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer _run substr operation in StateMachine::ProcessString #10471

Merged
merged 7 commits into from
Jun 29, 2021

Conversation

skyline75489
Copy link
Collaborator

@skyline75489 skyline75489 commented Jun 21, 2021

Do not eagerly create run by substring to reduce the unnecessary overhead.

Switching from .substr to offset/length tracking saves us 7% CPU time.

@skyline75489
Copy link
Collaborator Author

This saves ~7% of CPU time

Before:

image

After:

image

@skyline75489 skyline75489 added the Area-Performance Performance-related issue label Jun 21, 2021
while (current < string.size())
{
// The run will be everything from the start INCLUDING the current one
// in case we process the current character and it turns into a passthrough
// fallback that picks up this _run inside `FlushToTerminal` above.
_run = string.substr(start, current - start + 1);
Copy link
Collaborator Author

@skyline75489 skyline75489 Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeatedly substring turns out to be quite expensive and takes up about 10%-12% of the total CPU time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly forgot whether the 10% here is in OpenConsole or WT, or both...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like you should leave the comment/detail somewhere in here that we shouldn't get the future-bright-idea to "improve safety" here by re-wrapping this into a convenience class. It needs to stay as a raw sort of offset thing because of how high performance this needs to be.

@skyline75489 skyline75489 changed the title Defer _run substr operation in ProcessString Defer _run substr operation in StateMachine::ProcessString Jun 21, 2021
@@ -1838,12 +1839,17 @@ void StateMachine::ProcessString(const std::wstring_view string)
size_t start = 0;
size_t current = start;

_currentString = string;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I need to reset these in _EnterGround ? Feels like I should.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this any more. Because ResetState() is also called inside ProcessString.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you were saying, but I agree this appears as a valid exception.

Copy link
Member

@lhecker lhecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make _run a local variable inside StateMachine::ProcessString.
This removes the _run member entirely and removes the redundant state from this class. Can't forget to call _ActuateCurrentRun, if there's nothing to forget, right?

@skyline75489
Copy link
Collaborator Author

Thanks @lhecker for the wonderful idea!

Note that StateMachine::ProcessString is used heavily in both OpenConsole & WT. I'd expect noticeable performance
improvement in the official build, which is PGO enhanced. My local build isn't with PGO, so the improvement is limited.

@@ -1885,14 +1891,23 @@ void StateMachine::ProcessString(const std::wstring_view string)
// When we leave the loop, current has been advanced to the length of the string itself
// (or one past the array index to the final char) so this `substr` operation doesn't +1
// to include the final character (unlike the one inside the top of the loop above.)
_run = start < string.size() ? string.substr(start) : std::wstring_view{};
if (start < string.size())
Copy link
Collaborator Author

@skyline75489 skyline75489 Jun 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this piece of comment above, be it before this PR or after. Still the logic here remains intact.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's saying that on line 1846, a .substr is called with the 2nd parameter having a +1 on the end as a magic value that helps select the correct distance.

So one might quickly think that this similar .substr call on line 1888 should also have a +1 magic value and there's an error. But then it goes on to clarify that the nature of the looping made the magic +1 unnecessary.

Or at least that's my understanding of the comment. It looks like this code may have been revised a few times since it was originally written.

@skyline75489
Copy link
Collaborator Author

skyline75489 commented Jun 26, 2021

Strange. I can't repro the feature tests failure locally. Maybe it's transient? Also the x86 tests are OK.

@skyline75489
Copy link
Collaborator Author

OK the test failures seems to be transient. Merged main and now the CI is green.

Copy link
Member

@lhecker lhecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment.

Copy link
Member

@DHowett DHowett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A string_view is just a pointer and a size; can you go into some detail on why exactly this is so much more performant? Is it just the repeated bounds checks in substr that the language standard requires? There must be a better way to deal with this than to move back to counts and offsets...

@lhecker
Copy link
Member

lhecker commented Jun 27, 2021

@DHowett This is a CPU trace of OpenConsole when running termbench:
Screenshot 2021-06-28 003327

I've benchmarked this patch on my experimental, fast glyph-atlas renderer and it improves VT performance from 490 kcg/s up to 550 kcg/s, or a whopping +12% just like that!
(To anyone reading this: Yes, 550 kcg/s isn't comparatively fast, but the cause for this is simply the mere lack of proper double buffering.)

@DHowett
Copy link
Member

DHowett commented Jun 27, 2021

Right, but substr is a simple operation. What's inside it that's so expensive?

@skyline75489
Copy link
Collaborator Author

That single line is combination of: substr, constructing a string_view, and copying a string_view into _run. You want me to find a winner among these three? 😅

Talked with Leonard. He thinks it’s substr with all the internal checks for boundaries and everything.

@lhecker
Copy link
Member

lhecker commented Jun 27, 2021

@DHowett Benchmarking suggests that the removal of .at() improves performance by just 10 kcg/s (+2%) and the removal of .substr() by 50 kcg/s (+10%).
I believe this is caused by inoptimal assembly generated due to .substr(), causing conditional moves and jumps to be emitted. Additionally I believe that exception-throwing code paths cause the CPU pipeline to be less effective.

@lhecker
Copy link
Member

lhecker commented Jun 27, 2021

The difference in assembly is basically (shortened)....

Before:

; exception handling for .substr()
cmp     rsi, rbx
jb      SHORT $LN27@ProcessStr 
; calling ProcessCharacter (part 1)
mov     rcx, QWORD PTR [r14]
; bad substr assembly
mov     r8, rdi
sub     r8, rbx
mov     rdx, rsi
inc     r8
sub     rdx, rbx
cmp     rdx, r8
lea     rax, QWORD PTR [rcx+rbx*2]
cmovb   r8, rdx
mov     QWORD PTR $T1[rsp], rax
mov     QWORD PTR $T1[rsp+8], r8
movups  xmm0, XMMWORD PTR $T1[rsp]                                                        ; < slow (~5 cycles latency)
movups  XMMWORD PTR std::basic_string_view<wchar_t,std::char_traits<wchar_t> > _run, xmm0 ; < slow (another ~5 cycles)
; exception handling for .at()
cmp     rsi, rdi
jbe     SHORT $LN27@ProcessStr
; calling ProcessCharacter (part 2)
movzx   ecx, WORD PTR [rcx+rdi*2]
call    ProcessCharacter(wchar_t)

After:

; nice counting stuff
mov     QWORD PTR unsigned __int64 _runOffset, rbx
mov     rax, rdi
sub     rax, rbx
inc     rax
mov     QWORD PTR unsigned __int64 _runSize, rax
; calling ProcessCharacter
movzx   ecx, WORD PTR [rsi]
call    ProcessCharacter(wchar_t)

@skyline75489
Copy link
Collaborator Author

Wow after seeing the assembly code posted by @lhecker , I'm not sure if there is "a better way to deal with this than to move back to counts and offsets". Not try to brag here, but the "counting stuff" looks so much better!

@DHowett
Copy link
Member

DHowett commented Jun 28, 2021

Thanks for the explanation 😄

@DHowett
Copy link
Member

DHowett commented Jun 28, 2021

@msftbot make sure @miniksa signs off

@ghost ghost added the AutoMerge Marked for automatic merge by the bot when requirements are met label Jun 28, 2021
@ghost
Copy link

ghost commented Jun 28, 2021

Hello @DHowett!

Because you've given me some instructions on how to help merge this pull request, I'll be modifying my merge approach. Here's how I understand your requirements for merging this pull request:

  • I'll only merge this pull request if it's approved by @miniksa

If this doesn't seem right to you, you can tell me to cancel these instructions and use the auto-merge policy that has been configured for this repository. Try telling me "forget everything I just told you".

@ghost ghost added the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Jun 28, 2021
Copy link
Member

@miniksa miniksa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it makes things faster, I'm all over it. I just want you to try to make sure that someone doesn't lose this again in the future when they forget this particular lesson and go on a memory safety campaign.

@ghost ghost removed Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something AutoMerge Marked for automatic merge by the bot when requirements are met labels Jun 29, 2021
@skyline75489
Copy link
Collaborator Author

skyline75489 commented Jun 29, 2021

With this PR, together with the previous ones that has been merged into main, I'm seeing 80% performance boost in OpenConsole (88kcg) comparing to cmd.exe (19043.1052) (50kcg), and this is only my local build!

@lhecker lhecker added the AutoMerge Marked for automatic merge by the bot when requirements are met label Jun 29, 2021
@ghost
Copy link

ghost commented Jun 29, 2021

Hello @lhecker!

Because this pull request has the AutoMerge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@DHowett DHowett dismissed miniksa’s stale review June 29, 2021 22:47

Chester addressed Michael's concerns.

@DHowett DHowett merged commit 6a37818 into microsoft:main Jun 29, 2021
DHowett pushed a commit that referenced this pull request Jul 7, 2021
Do not eagerly create run by substring to reduce the unnecessary overhead.

Switching from `.substr` to offset/length tracking saves us 7% CPU time.

(cherry picked from commit 6a37818)
@ghost
Copy link

ghost commented Jul 14, 2021

🎉Windows Terminal v1.9.1942.0 has been released which incorporates this pull request.:tada:

Handy links:

@ghost
Copy link

ghost commented Jul 14, 2021

🎉Windows Terminal Preview v1.10.1933.0 has been released which incorporates this pull request.:tada:

Handy links:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Performance Performance-related issue AutoMerge Marked for automatic merge by the bot when requirements are met
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants