-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removal of LAG with no members takes around 1 sec on 202012 #8071
Labels
Comments
5 tasks
qiluo-msft
pushed a commit
to sonic-net/sonic-swss
that referenced
this issue
Sep 17, 2021
…o PID. (#1841) This PR is intended to fix LAGs cleanup degradation caused by python2.7 -> python3 migration. The approach is to replace `teamd -k -t` call with the raw `SIGTERM` and add PID alive check. This will make sure the `teammgrd` is stopped only after all managed processes are being killed. resolves: sonic-net/sonic-buildimage#8071 **What I did** * Replaced `teamd -k -t` call with raw `SIGTERM` * Added PID alive check **Why I did it** * To fix LAGs cleanup timeout issue caused by python2.7 -> python3 upgrade **How I verified it** 1. Configure 64 LAG RIFs 2. Reload config
nazariig
added a commit
to nazariig/sonic-swss
that referenced
this issue
Sep 17, 2021
…c-net#1841) This PR is intended to fix LAGs cleanup degradation caused by python2.7 -> python3 migration. The approach is to replace `teamd -k -t` call with the raw `SIGTERM` and add PID alive check. This will make sure the `teammgrd` is stopped only after all managed processes are being killed. resolves: sonic-net/sonic-buildimage#8071 **What I did** * Replaced `teamd -k -t` call with raw `SIGTERM` * Added PID alive check **Why I did it** * To fix LAGs cleanup timeout issue caused by python2.7 -> python3 upgrade **How I verified it** 1. Configure 64 LAG RIFs 2. Reload config
qiluo-msft
pushed a commit
to sonic-net/sonic-swss
that referenced
this issue
Sep 20, 2021
This PR is intended to fix LAGs cleanup degradation caused by python2.7 -> python3 migration. The approach is to replace `teamd -k -t` call with the raw `SIGTERM` and add PID alive check. This will make sure the `teammgrd` is stopped only after all managed processes are being killed. resolves: sonic-net/sonic-buildimage#8071 **What I did** * Replaced `teamd -k -t` call with raw `SIGTERM` * Added PID alive check **Why I did it** * To fix LAGs cleanup timeout issue caused by python2.7 -> python3 upgrade **How I verified it** 1. Configure 64 LAG RIFs 2. Reload config **Details if related** * N/A
judyjoseph
pushed a commit
to sonic-net/sonic-swss
that referenced
this issue
Sep 27, 2021
…o PID. (#1841) This PR is intended to fix LAGs cleanup degradation caused by python2.7 -> python3 migration. The approach is to replace `teamd -k -t` call with the raw `SIGTERM` and add PID alive check. This will make sure the `teammgrd` is stopped only after all managed processes are being killed. resolves: sonic-net/sonic-buildimage#8071 **What I did** * Replaced `teamd -k -t` call with raw `SIGTERM` * Added PID alive check **Why I did it** * To fix LAGs cleanup timeout issue caused by python2.7 -> python3 upgrade **How I verified it** 1. Configure 64 LAG RIFs 2. Reload config
raphaelt-nvidia
pushed a commit
to raphaelt-nvidia/sonic-swss
that referenced
this issue
Oct 5, 2021
…o PID. (sonic-net#1841) This PR is intended to fix LAGs cleanup degradation caused by python2.7 -> python3 migration. The approach is to replace `teamd -k -t` call with the raw `SIGTERM` and add PID alive check. This will make sure the `teammgrd` is stopped only after all managed processes are being killed. resolves: sonic-net/sonic-buildimage#8071 **What I did** * Replaced `teamd -k -t` call with raw `SIGTERM` * Added PID alive check **Why I did it** * To fix LAGs cleanup timeout issue caused by python2.7 -> python3 upgrade **How I verified it** 1. Configure 64 LAG RIFs 2. Reload config
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Recently we have faced one very strange degradation in LAGs deinit process which requires your attention.
It takes place in 202012 while works perfectly fine in 201911.
The bug is caused by teamd.service long deinit time:
Jun 29 12:19:05.007799 sonic NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel57
Jun 29 12:19:05.023175 sonic NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel58 admin:0 oper:0 addr:1c:34:da:a1:96:80 ifindex:269 master:0 type:team
Jun 29 12:19:05.024577 sonic NOTICE swss#portsyncd: :- onMsg: nlmsg type:17 key:PortChannel58 admin:0 oper:0 addr:1c:34:da:a1:96:80 ifindex:269 master:0 type:team
Jun 29 12:19:05.927276 sonic INFO teamd#supervisord 2021-06-29 12:19:05,924 INFO reaped unknown pid 453 (exit status 0)
Jun 29 12:19:05.928010 sonic INFO teamd#supervisord 2021-06-29 12:19:05,926 INFO waiting for supervisor-proc-exit-listener, rsyslogd, teammgrd to die
Jun 29 12:19:06.024504 sonic NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel58
Jun 29 12:19:06.040217 sonic NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel59 admin:0 oper:0 addr:1c:34:da:a1:96:80 ifindex:270 master:0 type:team
Jun 29 12:19:06.044142 sonic NOTICE swss#portsyncd: :- onMsg: nlmsg type:17 key:PortChannel59 admin:0 oper:0 addr:1c:34:da:a1:96:80 ifindex:270 master:0 type:team
Jun 29 12:19:06.931980 sonic INFO teamd#supervisord 2021-06-29 12:19:06,930 INFO reaped unknown pid 461 (exit status 0)
This results in teammgrd is stopped before the cleanup is completed:
Jun 29 12:19:07.368929 sonic INFO dockerd[693]: time="2021-06-29T12:19:07.366572636Z" level=info msg="Container ed1dbc48e9acb7f01984c7e26e6c36a0496b906b2f38bcc092b8680c32dcde65 failed to exit within 60 seconds of signal 15 - using the force"
In general, it takes approximately ~ 1 sec to remove LAG with no members:
root@sonic:/# /usr/bin/teamd -r -t PortChannel8 -c '{"device":"PortChannel8","hwaddr":"1c:34:da:a1:96:80","runner":{"active":true,"name":"lacp"}}' -L /var/warmboot/teamd/ -g -d
Using team device "PortChannel8".
Using PID file "/var/run/teamd/PortChannel8.pid"
This program is not intended to be run as root.
root@sonic:/# time teamd -t PortChannel8 -k
real 0m1.020s
user 0m0.006s
sys 0m0.006s
The current teammgrd implementation uses teamd -k LAG cleanup interface:
https://github.com/Azure/sonic-swss/blob/202012/cfgmgr/teammgr.cpp#L116
bool TeamMgr::removeLag(const string &alias)
{
SWSS_LOG_ENTER();
}
void TeamMgr::cleanTeamProcesses()
{
SWSS_LOG_ENTER();
SWSS_LOG_NOTICE("Cleaning up LAGs during shutdown...");
for (const auto& it: m_lagList)
{
//This will call team -k kill -t which internally send SIGTERM
removeLag(it);
}
}
After some more investigation, it appears that the LAG remove call (teamd -k - t <port_channel_name>) is a blocking operation,
which consumes extra time because of supervisord is executed with python3.
Things work fine on 201911, where supervisord is executed with python2.7.
It seems, that the new python3 version somehow interferes teamd -k call,
which invokes libdaemon daemon_pid_file_kill_wait function, which uses kill system call to manage the process and select system call to deal with a timeout.
https://github.com/Distrotech/libdaemon/blob/master/libdaemon/dpid.c#L182
int daemon_pid_file_kill_wait(int s, int m) {
pid_t pid;
time_t t;
}
Here is the steps to reproduce:
Build teamd docker for 202012:
make target/docker-teamd.gz
Load teamd docker:
docker load < target/docker-teamd.gz
Compose a Dockerfile:
FROM docker-teamd:latest
RUN apt-get update && apt-get install python2.7 python-pip -y
RUN apt-get remove python3 python3-minimal -y
RUN pip2 install supervisor setuptools
Build teamd docker for python2.7:
docker build -t docker-teamd:p2.7 -f Dockerfile .
Test teamd python2.7 docker:
docker rm -f team_2.7
docker run --privileged -t -d --cap-add NET_ADMIN --name team_2.7 docker-teamd:p2.7
docker exec -ti team_3.7 bash
root@9742295dc210:/# ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 26748 13780 pts/0 Ss+ Jul01 0:21 /usr/bin/python /usr/local/bin/supervisord
root 532 0.0 0.0 3996 3192 pts/1 Ss 15:29 0:00 bash
root 703 0.0 0.0 7640 2692 pts/1 R+ 15:51 0:00 ps -aux
root@9742295dc210:/# teamd -r -t PortChannel6 -c '{"device":"PortChannel6","hwaddr":"1c:34:da:a1:96:80","runner":{"active":true,"name":"lacp"}}' -L /var/warmboot/teamd/ -g -d
Using team device "PortChannel6".
Using PID file "/var/run/teamd/PortChannel6.pid"
This program is not intended to be run as root.
root@9742295dc210:/# time teamd -k -t PortChannel6
real 0m0.108s
user 0m0.004s
sys 0m0.004s
docker rm -f team_3.7
docker run --privileged -t -d --cap-add NET_ADMIN --name team_3.7 docker-teamd:latest
docker exec -ti team_3.7 bash
root@9cd2d740705f:/# rm /etc/supervisor/critical_processes
root@9cd2d740705f:/# supervisorctl stop rsyslogd dependent-startup supervisor-proc-exit-listener teammgrd teamsyncd tlm_teamd
root@9cd2d740705f:/# ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.5 0.0 32288 24144 pts/0 Ss+ 12:57 0:00 /usr/bin/python3 /usr/local/bin/supervisord
root 27 0.0 0.0 3988 3260 pts/1 Ss 12:57 0:00 bash
root 36 0.0 0.0 7640 2708 pts/1 R+ 12:59 0:00 ps -aux
root@9cd2d740705f:/# teamd -r -t PortChannel6 -c '{"device":"PortChannel6","hwaddr":"1c:34:da:a1:96:80","runner":{"active":true,"name":"lacp"}}' -L /var/warmboot/teamd/ -g -d
Using team device "PortChannel6".
Using PID file "/var/run/teamd/PortChannel6.pid"
This program is not intended to be run as root.
root@9cd2d740705f:/# time teamd -k -t PortChannel6
real 0m1.010s
user 0m0.004s
sys 0m0.004s
Profiler output:
root@9742295dc210:/# apt-get install linux-perf
root@9742295dc210:/# teamd -r -t PortChannel6 -c '{"device":"PortChannel6","hwaddr":"1c:34:da:a1:96:80","runner":{"active":true,"name":"lacp"}}' -L /var/warmboot/teamd/ -g -d
Using team device "PortChannel6".
Using PID file "/var/run/teamd/PortChannel6.pid"
This program is not intended to be run as root.
root@9742295dc210:/# perf_4.19 record teamd -k -t PortChannel6
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.016 MB perf.data (39 samples) ]
root@9742295dc210:/# perf_4.19 report --stdio -g none --show-cpu-utilization
To display the perf.data header info, please use --header/--header-only options.
Total Lost Samples: 0
Samples: 39 of event 'cycles:ppp'
Event count (approx.): 10648759
Overhead sys usr Command Shared Object Symbol
........ ........ ........ ....... ................. .......................................
Summary:
python2.7:
root@9742295dc210:/# ./run.sh | grep -a spin
Using team device "PortChannel6".
Using PID file "/var/run/teamd/PortChannel6.pid"
This program is not intended to be run as root.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.000 MB (null) ]
python3:
root@9cd2d740705f:/# ./run.sh | grep -a spin
Using team device "PortChannel6".
Using PID file "/var/run/teamd/PortChannel6.pid"
This program is not intended to be run as root.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.000 MB (null) ]
2.84% 2.84% 0.00% teamd [kernel.kallsyms] [k] _raw_spin_lock
IMHO, the culprit is python2.7 -> python3 migration.
P.S: teamd -k call fails constantly with a timeout when python2.7/python3 is used as docker entrypoint and no issues observed with bash
The text was updated successfully, but these errors were encountered: