-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving retry logic in remote to handle more cases #111
Improving retry logic in remote to handle more cases #111
Conversation
remote/remote.go
Outdated
switch transportErr.StatusCode { | ||
case http.StatusNotFound, http.StatusUnauthorized: | ||
return true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch transportErr.StatusCode { | |
case http.StatusNotFound, http.StatusUnauthorized: | |
return true | |
} | |
if transportErr.StatusCode == http.StatusNotFound || transportErr.StatusCode == http.StatusUnauthorized { | |
return true | |
} |
I think that we usually use switch-case
when for each result, we should act differently.
You can also have only one "big" if condition for all of this transportErr
, but I'm not sure whether it'll be more readable. I don't have strong opinion on whether it should be one big if or two nested ifs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly looking both options if statement looks for readable to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this was already covered, but this has been confirmed that we're actually seeing 401
and 404
and then retrying the requests makes these go away?
I'm asking because 40x
errors are usually caused by the client, not by the server or by network interference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the only way I managed to reproduce the error mentioned by @jabrown85 in the ticket 1069 was having the newV1Image method returning an empty image, and that happens when we get 404, 401 or an error with the message "no child with platform", I am not very happy to retry with 40x but otherwise not sure how to address the issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Well, works for me 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yaelharel I just added the suggestion about the if statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@micahyoung do you have any thoughts here about the 40x errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Moving my comment to inline)
Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 1 Letterman Drive Suite D4700 San Francisco, CA, 94129 Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. Signed-off-by: Juan Bustamante <[email protected]>
Signed-off-by: Juan Bustamante <[email protected]>
Signed-off-by: Juan Bustamante <[email protected]>
Signed-off-by: Juan Bustamante <[email protected]>
Signed-off-by: Juan Bustamante <[email protected]>
…g the unit tests Signed-off-by: Juan Bustamante <[email protected]>
Signed-off-by: Juan Bustamante <[email protected]>
remote/remote_test.go
Outdated
server = mockServer.Server() | ||
u, err := url.Parse(server.URL) | ||
|
||
h.AssertNil(t, err) | ||
|
||
repoName = u.Hostname() + ":" + u.Port() + "/" + repo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can create an helper function for this in order to prevent code duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did, I put the function in testhelpers.go
file.
remote/remote_test.go
Outdated
defer server.Close() | ||
_, err := remote.NewImage(repoName, authn.DefaultKeychain, remote.WithPreviousImage(repoName)) | ||
|
||
h.AssertNil(t, err) | ||
h.AssertEq(t, mockServer.ActualCount(), 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as https://github.com/buildpacks/imgutil/pull/111/files#r631156024.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put the helper method in the same remote_test.go
file because I need to access the remote package, is that ok?
Signed-off-by: Juan Bustamante <[email protected]>
testhelpers/testhelpers.go
Outdated
@@ -409,3 +410,14 @@ func checkResponseError(r io.Reader) error { | |||
|
|||
return nil | |||
} | |||
|
|||
func SetUpMockServer(t *testing.T, repo string, statusCode, failedCount int) (*MockServer, string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about moving this function to mock_registry.go
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it at first, but then I was not sure, and because of this testhelpers are other methods reuse in other places I decided to put it here. But, I think it will not harm anyone if we move it to mock_registry.go
Made a few more suggestions for readability. There is one subtle bug to fix (see #111 (comment) ) but overall this is looking good! |
Signed-off-by: Juan Bustamante <[email protected]>
I wonder if google/go-containerregistry#1019 is relevant to this PR. |
From my perspective this looks good. I left a couple of nits :) I do think it would be good to know what is the overlap between what this PR is trying to accomplish and google/go-containerregistry#1019. Could we remove some of the logic from our side? |
Signed-off-by: Juan Bustamante <[email protected]>
Signed-off-by: Juan Bustamante <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
I think the logic implemented in that PR could affect the EOF logic implemented on our side.
We can remove it, but I think we can do that later, the worst case I see is an EOF is thrown and it will be handled for us, so this retry logic will not be executed, it doesn't harm anyone if we keep it. That's my opinion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor questions/nits
if i != maxRetries { | ||
continue // retry | ||
} | ||
if len(transportErr.Errors) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with the transport.Error
- what is the purpose of the guard len(transportErr.Errors) > 0
? I see it existed before, but was it guarding a failure of transportErr.StatusCode
? The history of the files makes it look like we used to have this guard to protect transportErr.Errors[0].Code
- but since we aren't doing that anymore I think we could remove this check entirely.
switch transportErr.StatusCode { | ||
case http.StatusNotFound, http.StatusUnauthorized: | ||
if transportErr, ok := err.(*transport.Error); ok && is40x(transportErr) { | ||
if i != maxRetries { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the cost of retrying? If an image is truly not on a registry, what kind of impact are we introducing to that situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not wrong @natalieparellano estimated that the following fix was saving us 300 milliseconds on each retry, we are retrying 3 times I believe the impact is about 900 milliseconds overhead if the image actually doesn't exist in the registry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we concerned about that? I feel like I want to be able to send something in like WithExpectedToExist()
😄 so that the consumer of this function can add on the behavior. In the case of an app image in analyzer, for instance, I wouldn't want to retry on 404s since it is likely this is a first build situation. But for fetching a run image, I would expect it to exist.
}) | ||
}) | ||
|
||
when("manifest API in registry returns status code 401", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a test where we fail
succeed
to show that it works and assert we call twice but actually succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be a good test case, let me try to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, @jabrown85 the current tests is actually doing that.
mockServer, repoName = h.SetUpMockServer(t, "org/retry-unauthorized", http.StatusUnauthorized, 2)
the mock server is initialized to failed 2 times, on the third invocation it will respond with a good response.
Then
assertExpectedTries(t, mockServer, repoName, 3)
is verifying the remote path was called 3 times and inside that verification, I am checking the method was executed succeded
func assertExpectedTries(t *testing.T, mockServer *h.MockServer, repoName string, expectedCount int) {
_, err := remote.NewImage(repoName, authn.DefaultKeychain, remote.WithPreviousImage(repoName))
h.AssertNil(t, err). // Method was succeded
h.AssertEq(t, mockServer.ActualCount(), expectedCount)
}
Maybe I can save how many times the bad response was returned and assert against that value? also, that number can be calculated based on the failed times configuration and the total number of invocations too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, ok. I didn't read that the next time would be successful. I think you are good to go
@jjbustamante do you mind if I close this PR? IIRC from Jesse's comment here we wanted to step back and re-evaluate the approach to avoid adding retry seconds for images that we expect might not exist. |
No problem at all |
References
Context
A previous Pull Request #96 was merged recently to add some retry logic when an EOF is thrown. The logic was extended to handle the cases in the ticket #1069 in Pack.
Description