Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[teamd]: Clean teamd process if LAG creation fails #2888

Merged
merged 9 commits into from
Sep 13, 2023

Conversation

theasianpianist
Copy link
Contributor

What I did

  • During LAG creation, if the teamd process fails for any reason, cleanup any leftover teamd processes for the LAG alias
  • Call kill directly in removeLag instead of using teamd -k

Why I did it

How I verified it

Details if related

@prsunny
Copy link
Collaborator

prsunny commented Aug 24, 2023

Thanks for fix, can you please add a unit test to cover?

Comment on lines 661 to 662
cmd << TEAMD_CMD << " -k -t " << shellquote(alias);
EXEC_WITH_ERROR_THROW(cmd.str(), res);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be causing issues. IIUC if the process actually wasn't running, then teamd -k -t PortChannel.1 would fail. So in the case where addLag() fails, but no teamd ... running, then teamd -k -t PortChannel. would fail, as a result an exception will be thrown, could you check?

Comment on lines 676 to 698
try
{
pid = static_cast<pid_t>(std::stoul(res, nullptr, 10));
SWSS_LOG_INFO("Read port channel %s pid %d", alias.c_str(), pid);
}
catch (const std::exception &e)
{
SWSS_LOG_ERROR("Failed to read port channel %s pid: %s", alias.c_str(), e.what());
return false;
}

try
{
std::stringstream cmd;
cmd << "kill -TERM " << pid;
EXEC_WITH_ERROR_THROW(cmd.str(), res);
}
catch (const std::exception &e)
{
SWSS_LOG_ERROR("Failed to send SIGTERM to port channel %s pid %d: %s", alias.c_str(), pid, e.what());
return false;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do this? It should be part of the teamd -k -t PortChannel1 IIUC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since teamd -k can block for some time as per sonic-net/sonic-buildimage#8071

Comment on lines +664 to +674
try
{
std::stringstream cmd;
cmd << "cat " << shellquote("/var/run/teamd/" + alias + ".pid");
EXEC_WITH_ERROR_THROW(cmd.str(), res);
}
catch (const std::exception &e)
{
SWSS_LOG_NOTICE("Failed to remove non-existent port channel %s pid...", alias.c_str());
return false;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a bit more, IMO this is internal implementation details of teamd, it's better not assume the location and naming of the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Teammgrd already reads this file in cleanTeamProcesses, and I don't think there's another easy way to get the teamd PID.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanTeamProcesses has most part of the code. Is it good to reuse the same code by passing pid? if no pid passed, then original flow. Just a thought

Comment on lines 661 to 663
cmd << TEAMD_CMD << " -k -t " << shellquote(alias);
EXEC_WITH_ERROR_THROW(cmd.str(), res);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, can I suggest to the following:

bool TeamMgr::removeLag(const string &alias, bool assertSuccess)
{
...
if (asserSuccess) {
    EXEC_WITH_ERROR_THROW(cmd.str(), res);
} else {
    exec( cmd.str(), res );
}
   ...
}

In the case when the addLag() fails, removeLag() will be called with assertSuccess=false, otherwise the code path stays the same. IMO this would address the issue while still keeping the addLag() and removeLag() symmetrical... my 2cents.

Comment on lines +664 to +674
try
{
std::stringstream cmd;
cmd << "cat " << shellquote("/var/run/teamd/" + alias + ".pid");
EXEC_WITH_ERROR_THROW(cmd.str(), res);
}
catch (const std::exception &e)
{
SWSS_LOG_NOTICE("Failed to remove non-existent port channel %s pid...", alias.c_str());
return false;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanTeamProcesses has most part of the code. Is it good to reuse the same code by passing pid? if no pid passed, then original flow. Just a thought

@prsunny
Copy link
Collaborator

prsunny commented Sep 7, 2023

@judyjoseph , would you take a look at the PR?

@prsunny prsunny merged commit 13ef25b into sonic-net:master Sep 13, 2023
StormLiangMS pushed a commit that referenced this pull request Sep 21, 2023
* [teamd]: Clean teamd process if LAG creation fails
StormLiangMS added a commit that referenced this pull request Sep 22, 2023
theasianpianist added a commit to theasianpianist/sonic-swss that referenced this pull request Sep 29, 2023
prsunny pushed a commit that referenced this pull request Oct 3, 2023
* Revert "Revert "[teamd]: Clean teamd process if LAG creation fails (#2888)""
This reverts commit 7097cf2.

During LAG creation, if the teamd process fails for any reason, cleanup any leftover teamd processes for the LAG alias
Call kill directly in removeLag instead of using teamd -k
@yxieca
Copy link
Contributor

yxieca commented Oct 12, 2023

@theasianpianist can you help create PR for 202205 branch? This change cannot be cherry-picked cleanly.

theasianpianist added a commit to theasianpianist/sonic-swss that referenced this pull request Oct 12, 2023
* [teamd]: Clean teamd process if LAG creation fails
@theasianpianist
Copy link
Contributor Author

@theasianpianist can you help create PR for 202205 branch? This change cannot be cherry-picked cleanly.

#2932 PR here, will monitor the PR checks

yxieca pushed a commit that referenced this pull request Nov 3, 2023
…2932)

What I did

202205 cherry-pick for [teamd]: Clean teamd process if LAG creation fails #2888
During LAG creation, if the teamd process fails for any reason, cleanup any leftover teamd processes for the LAG alias
Call kill directly in removeLag instead of using teamd -k

Why I did it

If the teamd process times out for any reason, subsequent attempts to create a LAG may fail if there is an orphaned process left running
As detailed in Removal of LAG with no members takes around 1 sec on 202012  sonic-buildimage#8071, team -k may block for some time.

Signed-off-by: Lawrence Lee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants