-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
entrypoint: in case of step command failure, write postfile #687
entrypoint: in case of step command failure, write postfile #687
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
test/taskrun_test.go
Outdated
tb.Command("/bin/sh"), tb.Args("-c", "exit 1"), | ||
), | ||
tb.Step("world", "busybox", | ||
tb.Command("/bin/sh"), tb.Args("-c", "sleep 20s"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this to detect if this step is executed or not 😅
d86f71b
to
fa47d21
Compare
Nice! /lgtm |
/hold Putting on hold for a review from @bobcatfish @pivotal-nader-ziada |
/test pull-tekton-pipeline-integration-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great PR @vdemeester 👍
some questions for my understanding
|
||
t.Logf("Waiting for TaskRun in namespace %s to fail", namespace) | ||
if err := WaitForTaskRunState(c, "failing-taskrun", TaskRunFailed("failing-taskrun"), "TaskRunFailed"); err != nil { | ||
t.Errorf("Error waiting for TaskRun to finish: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to check the status of the taskrun
after it failed and make sure step 2 has failure and step 3 didn't run. Before this fix, it was failing eventually by waiting for the default timeout, somehow if we can check that's not the failure reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, good point, I should validate that all steps after the exit 1
are terminated and in the same error state 👼
That raise a question : in the case of steps being "skipped" because a previous one failed, should we return another exit code than the default 1
? (to be able to detect such cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do that, do we end up with confusion if a step exits with a non-0/1 exit code on purpose? I'd lean towards using Reason
or Message
rather than ExitCode
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 it won't change the status of the TaskRun
. It would only be useful to detect that some of the container didn't have to execute. We could also "copy" the exit code of the failed process to the later ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what the best option is, would be nice to know it didn't execute, anything here makes sense? http://tldp.org/LDP/abs/html/exitcodes.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed quiclky with @abayer on slack :
exit 0
on skipped containers (the one after the failure)- updating the TaskRun status to point to the failed step
- updating the steps status with some skipped messages for the steps that were skip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- exit 0 on skipped containers (the one after the failure) 👍
- updating the TaskRun status to point to the failed step 🤔 but is that an api change to add that? It already has an array of steps status which should contain the same info
- updating the steps status with some skipped messages for the steps that were skip 🤔 I wonder if we are over thinking this. If a step fails, that should be clear enough why the taskrun failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll go for exit code 0 in here (and make sure the TaskRun status references the correct step as a failure) — we have time to think about skip messages & co later on, in a follow-up.
} else if !os.IsNotExist(err) { | ||
log.Fatalf("Waiting for %q: %v", file, err) | ||
return fmt.Errorf("Waiting for %q: %v", file, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this now going to return an error right away and make it not wait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't change the behavior as log.Fatalf
would have called os.Exit
(and killed the process). We just "control" where the kill
happens 😅
Entrypoint: *ep, | ||
WaitFile: *waitFile, | ||
PostFile: *postFile, | ||
Args: flag.Args(), | ||
Waiter: &RealWaiter{}, | ||
Runner: &RealRunner{}, | ||
PostWriter: &RealPostWriter{}, | ||
}.Go() | ||
} | ||
if err := e.Go(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this block of code make sense inside the Go
func?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially thought of puting it there — but it meant it was harder to "test" the Go
function as it would call os.Exit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough :)
fa47d21
to
ba0a03b
Compare
ba0a03b
to
9cd0521
Compare
@vdemeester So for now at least, we're not making any changes to the |
Indeed, for now, no change in there 😉 |
/hold cancel |
Excellent. =) /lgtm |
nice work @vdemeester 🎉 |
/hold |
I forgot to remove a piece of code 😹 |
The entrypoint package wraps the step commands and execute them. This allows use to use pods containers with some order. In a step, the entrypoint binary will wait for the file of the previous step to be present to execute the actual command. Before this change, if a command failed (`exit 1` or something), entrypoint would not write a file, and thus the whole pod would be stuck running (all the next step would wait forever). This fixes that by always writing the post-file — and making the *waiter* a bit smarter : - it will now look for a `{postfile}.err` to detect if the previous step failed or not. - if the previous steps failed, it will fail too without executing the step commands. Signed-off-by: Vincent Demeester <[email protected]>
9cd0521
to
8fa004b
Compare
/hold cancel |
/lgtm |
/test pull-tekton-pipeline-integration-tests |
Previously, we were never closing Idle connections leading to issues described in #687. This commit adds a fixed 2 minute timeout for idle connections though later we can also add other timeouts as well as allow for users to change the timeout values. I verified this manually by building on a base image with a shell and then verifying that the number of open connections eventually go down unlike before. Signed-off-by: Dibyo Mukherjee <[email protected]>
Changes
The entrypoint package wraps the step commands and execute them. This
allows use to use pods containers with some order. In a step, the
entrypoint binary will wait for the file of the previous step to be
present to execute the actual command.
Before this change, if a command failed (
exit 1
or something),entrypoint would not write a file, and thus the whole pod would be
stuck running (all the next step would wait forever).
This fixes that by always writing the post-file — and making
the waiter a bit smarter :
{postfile}.err
to detect if the previousstep failed or not.
step commands.
Closes #682
cc @pivotal-nader-ziada
I need to updatepkg/entrypoint
unit tests though 👼Signed-off-by: Vincent Demeester [email protected]
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
[ ] Includes docs (if user facing)See the contribution guide
for more details.