-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory leak on closed watch channel #1263
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericmillin: 0 warnings.
In response to this:
Description
This fixes a serious bug in the wait logic. If the watcher's channel closes, the loop tightens as it continuously retries until the command times out. This has three negative effects:
- The code continuously retries waiting even though the results channel is never recreated, meaning the wait will never succeed
- The CPU is pegged
time.After
creates a new timer on each invocation. The timers aren't garbage collected until they expire, resulting in a crippling memory leak.It's not clear to me why the results channel is closed, but it happens regularly in our testing. I believe some of the issues were introduced by the changes in November. The old logic appears to have recreated watchers on retries, which presumably included a new results channel.
Changes
Ultimately, the code needs a refactor, but the changes here fix the all the bugs other than the permanently closed results channel. I also removed some cruft in
waitForEvent::Wait
./lint
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Welcome @ericmillin! It looks like this is your first PR to knative/client 🎉 |
Hi @ericmillin. Thanks for your PR. I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thanks for the contribution! Could you please take look? |
@@ -236,15 +239,12 @@ func (w *waitForEvent) Wait(watcher watch.Interface, name string, options Option | |||
timeout := options.timeoutWithDefault() | |||
start := time.Now() | |||
// channel used to transport the error | |||
errChan := make(chan error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed as cruft
return fmt.Errorf("timeout: %s '%s' not ready after %d seconds", w.kind, name, int(timeout/time.Second)), time.Since(start) | ||
} | ||
|
||
if retry { | ||
// restart loop | ||
// sleep to prevent CPU pegging and restart the loop | ||
time.Sleep(pollInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necessary if there's no longer a chance the CPU can be pegged. However, unless you're 100% sure that can't happen, then it should remain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really clear to me. As I understand this, it's meant as a safeguard in case the retry loop would go wild again (like that waitForReadyCondition
returning immediately with retry == true, which has happened up to now when not detecting that the watcher channel has been closed).
Looking at the code I can't see how the CPU could be pegged, but I have no prove. Adding a sleep here however would make it hard to detect such a situation, but of course, could be 'good enough'. So I'm a bit biased here :)
However, I would not couple this to the pollInterval
which is used as poll interval for a special kind of watcher (i.e. when the API does not support intrinsic watching like this is the case for CloudRun). Instead, I probably would use a shorter safe-guard, maybe 200 or 500ms ? (but I would use a different constant or even a literal value here).
@@ -111,20 +111,20 @@ func NoopMessageCallback() MessageCallback { | |||
func (w *waitForReadyConfig) Wait(watcher watch.Interface, name string, options Options, msgCallback MessageCallback) (error, time.Duration) { | |||
|
|||
timeout := options.timeoutWithDefault() | |||
floatingTimeout := timeout | |||
timeoutTimer := time.NewTimer(timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reusing the same timer prevents the timer leak caused by time.After
. It is also cleaner because the floatingTImeout math is no longer needed.
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks cleaner. I guess this is difficult to test?
/ok-to-test
Yes, I tried with the mocks, but I couldn't get it working. You need to close the channel after the watcher starts. I didn't see a way to do that. Happy to take another crack at it if you see a way to make it all work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, very good catch ! I only have a minor question, otherwise we should merge this asap.
@maximilien I think this is really hard to test, maybe one could simulate a watch function that closes the channel but also detects if someone reads from that closed channel afterwards (waiting some time) and then fails the test if this is the casde. Not sure if this is easy to implement, I would merge the PR without it, too.
return fmt.Errorf("timeout: %s '%s' not ready after %d seconds", w.kind, name, int(timeout/time.Second)), time.Since(start) | ||
} | ||
|
||
if retry { | ||
// restart loop | ||
// sleep to prevent CPU pegging and restart the loop | ||
time.Sleep(pollInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really clear to me. As I understand this, it's meant as a safeguard in case the retry loop would go wild again (like that waitForReadyCondition
returning immediately with retry == true, which has happened up to now when not detecting that the watcher channel has been closed).
Looking at the code I can't see how the CPU could be pegged, but I have no prove. Adding a sleep here however would make it hard to detect such a situation, but of course, could be 'good enough'. So I'm a bit biased here :)
However, I would not couple this to the pollInterval
which is used as poll interval for a special kind of watcher (i.e. when the API does not support intrinsic watching like this is the case for CloudRun). Instead, I probably would use a shorter safe-guard, maybe 200 or 500ms ? (but I would use a different constant or even a literal value here).
Let me add a few thoughts after digging into watch API a bit.
That's a neat fix. Thanks for pointing that out. The go docs are pretty clear there.
Indeed you're correct here as well. With the fix for watcher race condition we actually lost the ability to recreate the watcher. https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
As pointed out by @markusthoemmes, the Therefore I think the right fix for the race condition is to leverage resourceVersion in our watch request. I've done a bit of testing and it seems to behave as expected, even with synthetic timeout postponing the watcher creation the events are lost. That'd make room to bring back the correct retry behaviour when channel is closed. As an good exampple and inspiration client-go's retrywatcher is done the same way. |
We're doing more testing with the change I made and it's definitely not enough to get the correct behavior, though it does solve the CPU and memory issues. Seems like the watch really needs to be recreate on closure. @dsimansk I've tried a different approach and reverted the November fix for the race. I then made some simpler changes to fix the issue. It works so far. Happy to share the changes. However, it sounds like |
@ericmillin I don't have the complete code yet, but you can take a look here. Hopefully I can put it together next week and have PR ready to gather more thoughts on it. Anyway we should get this PR merged once @rhuss questions are resolved. :) |
Hi @ericmillin I think we should get this fix for the next release which is scheduled for next Tuesday, and then we can still continue to work on the resource version syncing. That's what we agreed on in the last client WG call. Would you be able to resolve the conflicts ? If not, no problem, we would then take over. |
@rhuss Sorry going incommunicado--I was out of town. I can resolve the conflicts, but you may need to take over if you want this merged by Tuesday. I'm still awaiting approval for the CLA. The version syncing sounds brilliant. While my changes resolved some of the issues, we were still seeing unexpected behavior which is likely caused by fact that the logic exits when the channel is closed, rather than retrying. Because we had to release changes for an app using v0.19.x, I did a quick-and-dirty patch on a fork. The changes are a big improvement but still have problems accurately tracking state. @duglin has done a lot of manual testing and can provide more details if needed. This is my quick-and-dirty patch which reverts to the pre-November logic and takes a different approach to solving the race: ericmillin@a523953 It uses a channel returned by the Waiting() <-chan bool |
@ericmillin no worries, I will resolve the conflict now and then merge this PR for release 0.23 @dsimansk please pick up after the release to improve with the discussed idea on syncing on revisions. |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
The following is the coverage report on the affected files.
|
@googlebot I consent. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rhuss The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description
This fixes a serious bug in the wait logic. If the watcher's channel closes, the loop tightens as it continuously retries until the command times out. This has three negative effects:
time.After
creates a new timer on each invocation. The timers aren't garbage collected until they expire, resulting in a crippling memory leak.It's not clear to me why the results channel is closed, but it happens regularly in our testing. I believe some of the issues were introduced by the changes in November. The old logic appears to have recreated watchers on retries, which presumably included a new results channel.
Changes
Ultimately, the code needs a refactor, but the changes here fix the all the bugs other than the permanently closed results channel. I also removed some cruft in
waitForEvent::Wait
./lint