-
Notifications
You must be signed in to change notification settings - Fork 8.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defer _run substr operation in StateMachine::ProcessString #10471
Defer _run substr operation in StateMachine::ProcessString #10471
Conversation
while (current < string.size()) | ||
{ | ||
// The run will be everything from the start INCLUDING the current one | ||
// in case we process the current character and it turns into a passthrough | ||
// fallback that picks up this _run inside `FlushToTerminal` above. | ||
_run = string.substr(start, current - start + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeatedly substr
ing turns out to be quite expensive and takes up about 10%-12% of the total CPU time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly forgot whether the 10% here is in OpenConsole or WT, or both...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like you should leave the comment/detail somewhere in here that we shouldn't get the future-bright-idea to "improve safety" here by re-wrapping this into a convenience class. It needs to stay as a raw sort of offset thing because of how high performance this needs to be.
@@ -1838,12 +1839,17 @@ void StateMachine::ProcessString(const std::wstring_view string) | |||
size_t start = 0; | |||
size_t current = start; | |||
|
|||
_currentString = string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I need to reset these in _EnterGround
? Feels like I should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this any more. Because ResetState()
is also called inside ProcessString
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you were saying, but I agree this appears as a valid exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can make _run
a local variable inside StateMachine::ProcessString
.
This removes the _run
member entirely and removes the redundant state from this class. Can't forget to call _ActuateCurrentRun
, if there's nothing to forget, right?
Thanks @lhecker for the wonderful idea! Note that |
@@ -1885,14 +1891,23 @@ void StateMachine::ProcessString(const std::wstring_view string) | |||
// When we leave the loop, current has been advanced to the length of the string itself | |||
// (or one past the array index to the final char) so this `substr` operation doesn't +1 | |||
// to include the final character (unlike the one inside the top of the loop above.) | |||
_run = start < string.size() ? string.substr(start) : std::wstring_view{}; | |||
if (start < string.size()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand this piece of comment above, be it before this PR or after. Still the logic here remains intact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's saying that on line 1846, a .substr
is called with the 2nd parameter having a +1
on the end as a magic value that helps select the correct distance.
So one might quickly think that this similar .substr
call on line 1888 should also have a +1
magic value and there's an error. But then it goes on to clarify that the nature of the looping made the magic +1
unnecessary.
Or at least that's my understanding of the comment. It looks like this code may have been revised a few times since it was originally written.
Strange. I can't repro the feature tests failure locally. Maybe it's transient? Also the x86 tests are OK. |
OK the test failures seems to be transient. Merged main and now the CI is green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A string_view is just a pointer and a size; can you go into some detail on why exactly this is so much more performant? Is it just the repeated bounds checks in substr
that the language standard requires? There must be a better way to deal with this than to move back to counts and offsets...
@DHowett This is a CPU trace of OpenConsole when running termbench: I've benchmarked this patch on my experimental, fast glyph-atlas renderer and it improves VT performance from 490 kcg/s up to 550 kcg/s, or a whopping +12% just like that! |
Right, but substr is a simple operation. What's inside it that's so expensive? |
That single line is combination of: substr, constructing a string_view, and copying a string_view into _run. You want me to find a winner among these three? 😅 Talked with Leonard. He thinks it’s substr with all the internal checks for boundaries and everything. |
@DHowett Benchmarking suggests that the removal of |
The difference in assembly is basically (shortened).... Before: ; exception handling for .substr()
cmp rsi, rbx
jb SHORT $LN27@ProcessStr
; calling ProcessCharacter (part 1)
mov rcx, QWORD PTR [r14]
; bad substr assembly
mov r8, rdi
sub r8, rbx
mov rdx, rsi
inc r8
sub rdx, rbx
cmp rdx, r8
lea rax, QWORD PTR [rcx+rbx*2]
cmovb r8, rdx
mov QWORD PTR $T1[rsp], rax
mov QWORD PTR $T1[rsp+8], r8
movups xmm0, XMMWORD PTR $T1[rsp] ; < slow (~5 cycles latency)
movups XMMWORD PTR std::basic_string_view<wchar_t,std::char_traits<wchar_t> > _run, xmm0 ; < slow (another ~5 cycles)
; exception handling for .at()
cmp rsi, rdi
jbe SHORT $LN27@ProcessStr
; calling ProcessCharacter (part 2)
movzx ecx, WORD PTR [rcx+rdi*2]
call ProcessCharacter(wchar_t) After: ; nice counting stuff
mov QWORD PTR unsigned __int64 _runOffset, rbx
mov rax, rdi
sub rax, rbx
inc rax
mov QWORD PTR unsigned __int64 _runSize, rax
; calling ProcessCharacter
movzx ecx, WORD PTR [rsi]
call ProcessCharacter(wchar_t) |
Wow after seeing the assembly code posted by @lhecker , I'm not sure if there is "a better way to deal with this than to move back to counts and offsets". Not try to brag here, but the "counting stuff" looks so much better! |
Thanks for the explanation 😄 |
@msftbot make sure @miniksa signs off |
Hello @DHowett! Because you've given me some instructions on how to help merge this pull request, I'll be modifying my merge approach. Here's how I understand your requirements for merging this pull request:
If this doesn't seem right to you, you can tell me to cancel these instructions and use the auto-merge policy that has been configured for this repository. Try telling me "forget everything I just told you". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it makes things faster, I'm all over it. I just want you to try to make sure that someone doesn't lose this again in the future when they forget this particular lesson and go on a memory safety campaign.
With this PR, together with the previous ones that has been merged into main, I'm seeing 80% performance boost in |
Hello @lhecker! Because this pull request has the p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (
|
Do not eagerly create run by substring to reduce the unnecessary overhead. Switching from `.substr` to offset/length tracking saves us 7% CPU time. (cherry picked from commit 6a37818)
🎉 Handy links: |
🎉 Handy links: |
Do not eagerly create run by substring to reduce the unnecessary overhead.
Switching from
.substr
to offset/length tracking saves us 7% CPU time.