Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving retry logic in remote to handle more cases #111

Closed
wants to merge 13 commits into from
Closed

Improving retry logic in remote to handle more cases #111

wants to merge 13 commits into from

Conversation

jjbustamante
Copy link
Member

@jjbustamante jjbustamante commented May 5, 2021

References

Context

A previous Pull Request #96 was merged recently to add some retry logic when an EOF is thrown. The logic was extended to handle the cases in the ticket #1069 in Pack.

Description

  • The PR adds some unit tests to verify the behavior of the retry logic implemented in remote.NewV1Image method
  • A little bug was fixed, when there is not problem during the communication with the docker registry, the current was been retried two more times (a break was added to avoid this)

@jjbustamante jjbustamante requested a review from a team as a code owner May 5, 2021 16:48
remote/remote.go Outdated
Comment on lines 220 to 223
switch transportErr.StatusCode {
case http.StatusNotFound, http.StatusUnauthorized:
return true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
switch transportErr.StatusCode {
case http.StatusNotFound, http.StatusUnauthorized:
return true
}
if transportErr.StatusCode == http.StatusNotFound || transportErr.StatusCode == http.StatusUnauthorized {
return true
}

I think that we usually use switch-case when for each result, we should act differently.
You can also have only one "big" if condition for all of this transportErr, but I'm not sure whether it'll be more readable. I don't have strong opinion on whether it should be one big if or two nested ifs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly looking both options if statement looks for readable to me

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this was already covered, but this has been confirmed that we're actually seeing 401 and 404 and then retrying the requests makes these go away?

I'm asking because 40x errors are usually caused by the client, not by the server or by network interference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the only way I managed to reproduce the error mentioned by @jabrown85 in the ticket 1069 was having the newV1Image method returning an empty image, and that happens when we get 404, 401 or an error with the message "no child with platform", I am not very happy to retry with 40x but otherwise not sure how to address the issue

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Well, works for me 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaelharel I just added the suggestion about the if statement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@micahyoung do you have any thoughts here about the 40x errors?

Copy link
Member

@micahyoung micahyoung May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Moving my comment to inline)

Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.

Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

Signed-off-by: Juan Bustamante <[email protected]>
Signed-off-by: Juan Bustamante <[email protected]>
Signed-off-by: Juan Bustamante <[email protected]>
Signed-off-by: Juan Bustamante <[email protected]>
Signed-off-by: Juan Bustamante <[email protected]>
Comment on lines 538 to 543
server = mockServer.Server()
u, err := url.Parse(server.URL)

h.AssertNil(t, err)

repoName = u.Hostname() + ":" + u.Port() + "/" + repo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can create an helper function for this in order to prevent code duplication.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did, I put the function in testhelpers.go file.

Comment on lines 547 to 551
defer server.Close()
_, err := remote.NewImage(repoName, authn.DefaultKeychain, remote.WithPreviousImage(repoName))

h.AssertNil(t, err)
h.AssertEq(t, mockServer.ActualCount(), 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@jjbustamante jjbustamante May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the helper method in the same remote_test.go file because I need to access the remote package, is that ok?

@@ -409,3 +410,14 @@ func checkResponseError(r io.Reader) error {

return nil
}

func SetUpMockServer(t *testing.T, repo string, statusCode, failedCount int) (*MockServer, string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving this function to mock_registry.go?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it at first, but then I was not sure, and because of this testhelpers are other methods reuse in other places I decided to put it here. But, I think it will not harm anyone if we move it to mock_registry.go

@natalieparellano
Copy link
Member

Made a few more suggestions for readability. There is one subtle bug to fix (see #111 (comment) ) but overall this is looking good!

@natalieparellano
Copy link
Member

I wonder if google/go-containerregistry#1019 is relevant to this PR.

@natalieparellano
Copy link
Member

From my perspective this looks good. I left a couple of nits :)

I do think it would be good to know what is the overlap between what this PR is trying to accomplish and google/go-containerregistry#1019. Could we remove some of the logic from our side?

Copy link
Contributor

@yaelharel yaelharel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@jjbustamante
Copy link
Member Author

From my perspective this looks good. I left a couple of nits :)

I do think it would be good to know what is the overlap between what this PR is trying to accomplish and google/go-containerregistry#1019. Could we remove some of the logic from our side?

I think the logic implemented in that PR could affect the EOF logic implemented on our side.

if err == io.EOF &amp;&amp; i != maxRetries {				
     continue // retry			
}

We can remove it, but I think we can do that later, the worst case I see is an EOF is thrown and it will be handled for us, so this retry logic will not be executed, it doesn't harm anyone if we keep it. That's my opinion

Copy link
Contributor

@jabrown85 jabrown85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor questions/nits

if i != maxRetries {
continue // retry
}
if len(transportErr.Errors) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the transport.Error - what is the purpose of the guard len(transportErr.Errors) > 0? I see it existed before, but was it guarding a failure of transportErr.StatusCode? The history of the files makes it look like we used to have this guard to protect transportErr.Errors[0].Code - but since we aren't doing that anymore I think we could remove this check entirely.

switch transportErr.StatusCode {
case http.StatusNotFound, http.StatusUnauthorized:
if transportErr, ok := err.(*transport.Error); ok && is40x(transportErr) {
if i != maxRetries {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the cost of retrying? If an image is truly not on a registry, what kind of impact are we introducing to that situation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not wrong @natalieparellano estimated that the following fix was saving us 300 milliseconds on each retry, we are retrying 3 times I believe the impact is about 900 milliseconds overhead if the image actually doesn't exist in the registry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we concerned about that? I feel like I want to be able to send something in like WithExpectedToExist() 😄 so that the consumer of this function can add on the behavior. In the case of an app image in analyzer, for instance, I wouldn't want to retry on 404s since it is likely this is a first build situation. But for fetching a run image, I would expect it to exist.

})
})

when("manifest API in registry returns status code 401", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a test where we fail succeed to show that it works and assert we call twice but actually succeed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be a good test case, let me try to implement it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, @jabrown85 the current tests is actually doing that.

mockServer, repoName = h.SetUpMockServer(t, "org/retry-unauthorized", http.StatusUnauthorized, 2) the mock server is initialized to failed 2 times, on the third invocation it will respond with a good response.

Then

assertExpectedTries(t, mockServer, repoName, 3) is verifying the remote path was called 3 times and inside that verification, I am checking the method was executed succeded

func assertExpectedTries(t *testing.T, mockServer *h.MockServer, repoName string, expectedCount int) {
	_, err := remote.NewImage(repoName, authn.DefaultKeychain, remote.WithPreviousImage(repoName))
	h.AssertNil(t, err). // Method was succeded 
	h.AssertEq(t, mockServer.ActualCount(), expectedCount)
}

Maybe I can save how many times the bad response was returned and assert against that value? also, that number can be calculated based on the failed times configuration and the total number of invocations too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok. I didn't read that the next time would be successful. I think you are good to go

@natalieparellano
Copy link
Member

@jjbustamante do you mind if I close this PR? IIRC from Jesse's comment here we wanted to step back and re-evaluate the approach to avoid adding retry seconds for images that we expect might not exist.

@jjbustamante
Copy link
Member Author

@jjbustamante do you mind if I close this PR? IIRC from Jesse's comment here we wanted to step back and re-evaluate the approach to avoid adding retry seconds for images that we expect might not exist.

No problem at all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants