-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[teamd]: Clean teamd process if LAG creation fails #2888
Conversation
Signed-off-by: Lawrence Lee <[email protected]>
Thanks for fix, can you please add a unit test to cover? |
Signed-off-by: Lawrence Lee <[email protected]>
4a1b951
to
706b5e6
Compare
Signed-off-by: Lawrence Lee <[email protected]>
cfgmgr/teammgr.cpp
Outdated
cmd << TEAMD_CMD << " -k -t " << shellquote(alias); | ||
EXEC_WITH_ERROR_THROW(cmd.str(), res); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be causing issues. IIUC if the process actually wasn't running, then teamd -k -t PortChannel.1
would fail. So in the case where addLag()
fails, but no teamd ...
running, then teamd -k -t PortChannel.
would fail, as a result an exception will be thrown, could you check?
cfgmgr/teammgr.cpp
Outdated
try | ||
{ | ||
pid = static_cast<pid_t>(std::stoul(res, nullptr, 10)); | ||
SWSS_LOG_INFO("Read port channel %s pid %d", alias.c_str(), pid); | ||
} | ||
catch (const std::exception &e) | ||
{ | ||
SWSS_LOG_ERROR("Failed to read port channel %s pid: %s", alias.c_str(), e.what()); | ||
return false; | ||
} | ||
|
||
try | ||
{ | ||
std::stringstream cmd; | ||
cmd << "kill -TERM " << pid; | ||
EXEC_WITH_ERROR_THROW(cmd.str(), res); | ||
} | ||
catch (const std::exception &e) | ||
{ | ||
SWSS_LOG_ERROR("Failed to send SIGTERM to port channel %s pid %d: %s", alias.c_str(), pid, e.what()); | ||
return false; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do this? It should be part of the teamd -k -t PortChannel1
IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since teamd -k
can block for some time as per sonic-net/sonic-buildimage#8071
try | ||
{ | ||
std::stringstream cmd; | ||
cmd << "cat " << shellquote("/var/run/teamd/" + alias + ".pid"); | ||
EXEC_WITH_ERROR_THROW(cmd.str(), res); | ||
} | ||
catch (const std::exception &e) | ||
{ | ||
SWSS_LOG_NOTICE("Failed to remove non-existent port channel %s pid...", alias.c_str()); | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this a bit more, IMO this is internal implementation details of teamd
, it's better not assume the location and naming of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Teammgrd already reads this file in cleanTeamProcesses
, and I don't think there's another easy way to get the teamd PID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleanTeamProcesses
has most part of the code. Is it good to reuse the same code by passing pid? if no pid passed, then original flow. Just a thought
Signed-off-by: Lawrence Lee <[email protected]>
Signed-off-by: Lawrence Lee <[email protected]>
Signed-off-by: Lawrence Lee <[email protected]>
Signed-off-by: Lawrence Lee <[email protected]>
cfgmgr/teammgr.cpp
Outdated
cmd << TEAMD_CMD << " -k -t " << shellquote(alias); | ||
EXEC_WITH_ERROR_THROW(cmd.str(), res); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, can I suggest to the following:
bool TeamMgr::removeLag(const string &alias, bool assertSuccess)
{
...
if (asserSuccess) {
EXEC_WITH_ERROR_THROW(cmd.str(), res);
} else {
exec( cmd.str(), res );
}
...
}
In the case when the addLag()
fails, removeLag()
will be called with assertSuccess=false
, otherwise the code path stays the same. IMO this would address the issue while still keeping the addLag()
and removeLag()
symmetrical... my 2cents.
try | ||
{ | ||
std::stringstream cmd; | ||
cmd << "cat " << shellquote("/var/run/teamd/" + alias + ".pid"); | ||
EXEC_WITH_ERROR_THROW(cmd.str(), res); | ||
} | ||
catch (const std::exception &e) | ||
{ | ||
SWSS_LOG_NOTICE("Failed to remove non-existent port channel %s pid...", alias.c_str()); | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleanTeamProcesses
has most part of the code. Is it good to reuse the same code by passing pid? if no pid passed, then original flow. Just a thought
@judyjoseph , would you take a look at the PR? |
* [teamd]: Clean teamd process if LAG creation fails
This reverts commit c7e5f10.
…onic-net#2888)"" This reverts commit 7097cf2.
@theasianpianist can you help create PR for 202205 branch? This change cannot be cherry-picked cleanly. |
* [teamd]: Clean teamd process if LAG creation fails
#2932 PR here, will monitor the PR checks |
…2932) What I did 202205 cherry-pick for [teamd]: Clean teamd process if LAG creation fails #2888 During LAG creation, if the teamd process fails for any reason, cleanup any leftover teamd processes for the LAG alias Call kill directly in removeLag instead of using teamd -k Why I did it If the teamd process times out for any reason, subsequent attempts to create a LAG may fail if there is an orphaned process left running As detailed in Removal of LAG with no members takes around 1 sec on 202012 sonic-buildimage#8071, team -k may block for some time. Signed-off-by: Lawrence Lee <[email protected]>
What I did
kill
directly inremoveLag
instead of usingteamd -k
Why I did it
team -k
may block for some time.How I verified it
Details if related