Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[teamd]: Clean teamd process if LAG creation fails #2888

Merged
merged 9 commits into from
Sep 13, 2023
38 changes: 38 additions & 0 deletions cfgmgr/teammgr.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,8 @@ void TeamMgr::doLagTask(Consumer &consumer)
{
if (addLag(alias, min_links, fallback, fast_rate) == task_need_retry)
{
// If LAG creation fails, we need to clean up any potentially orphaned teamd processes
removeLag(alias);
it++;
continue;
}
Expand Down Expand Up @@ -654,10 +656,46 @@ bool TeamMgr::removeLag(const string &alias)

stringstream cmd;
string res;
pid_t pid;

cmd << TEAMD_CMD << " -k -t " << shellquote(alias);
EXEC_WITH_ERROR_THROW(cmd.str(), res);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be causing issues. IIUC if the process actually wasn't running, then teamd -k -t PortChannel.1 would fail. So in the case where addLag() fails, but no teamd ... running, then teamd -k -t PortChannel. would fail, as a result an exception will be thrown, could you check?


try
{
std::stringstream cmd;
cmd << "cat " << shellquote("/var/run/teamd/" + alias + ".pid");
EXEC_WITH_ERROR_THROW(cmd.str(), res);
}
catch (const std::exception &e)
{
SWSS_LOG_NOTICE("Failed to remove non-existent port channel %s pid...", alias.c_str());
return false;
}
Comment on lines +661 to +671

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a bit more, IMO this is internal implementation details of teamd, it's better not assume the location and naming of the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Teammgrd already reads this file in cleanTeamProcesses, and I don't think there's another easy way to get the teamd PID.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanTeamProcesses has most part of the code. Is it good to reuse the same code by passing pid? if no pid passed, then original flow. Just a thought


try
{
pid = static_cast<pid_t>(std::stoul(res, nullptr, 10));
SWSS_LOG_INFO("Read port channel %s pid %d", alias.c_str(), pid);
}
catch (const std::exception &e)
{
SWSS_LOG_ERROR("Failed to read port channel %s pid: %s", alias.c_str(), e.what());
return false;
}

try
{
std::stringstream cmd;
cmd << "kill -TERM " << pid;
EXEC_WITH_ERROR_THROW(cmd.str(), res);
}
catch (const std::exception &e)
{
SWSS_LOG_ERROR("Failed to send SIGTERM to port channel %s pid %d: %s", alias.c_str(), pid, e.what());
return false;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do this? It should be part of the teamd -k -t PortChannel1 IIUC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since teamd -k can block for some time as per sonic-net/sonic-buildimage#8071

SWSS_LOG_NOTICE("Stop port channel %s", alias.c_str());

return true;
Expand Down